Introduction
Execution nodes are one of the critical components of the Flow network. They are the powerhouse which crunches all transactions and stores the data that the users have in their wallets. They are the essential piece of the puzzle to scaling transaction throughput and the amount of data stored on-chain.
All data that is updated through an execution of a Cadence smart contract is called “Execution data” and is stored in a Memory-trie (MTrie) data structure. MTrie serves 2 purposes - it holds all execution state data and also calculates a new hash every time the data changes. This hash is used by verification nodes to validate the execution result.
We chose the simple solution for execution state storage, which was to store the whole state in memory. It enables fast execution and we knew we have a lot of space to grow before we need to implement more scalable solution.
As of July 2023, FLOW is running at ~10% transaction throughput capacity (at the average of ~650K daily transactions in 2023 so far), so we still have room to grow.
Execution state growth has been a bigger challenge for us over the last year. Despite continuously optimizing the execution state storage, we had to increase memory on execution nodes this year to keep up with the growing execution state (the average memory used is hovering around 700GB of RAM) and we are again getting close to 90% of memory used during peak loads.
The purpose of this article is to explain the challenges we face with the execution node, our priorities and approach in solving those challenges. The planned changes described below will not require any changes in smart contracts or dapps running on flow.
Solving the problem
We are working on 2 initiatives in parallel to address the challenges with growing size of the execution state - Optimizing the in-memory execution state data structure (“Atree register inlining” project) and execution state storage on disk (”Storehouse” project).
Atree Register Inlining
The primary goal of the Atree Register Inlining project is to reduce the memory usage of Execution Nodes, while also reducing usage of storage, network and CPU.
Secondary goal is to slow down the growth rate of memory and storage required by Execution Nodes.
Atree is used by Cadence execution runtime to manage and store data in units called payloads. Each payload has direct and indirect overhead which consumes resources (memory, storage, network and CPU). For example, each payload requires extra set of data to be created and stored (including its own SHA3 hash).
The redesigned version of Atree will merge small payloads, which will reduce total number of payloads by over 50%. This will result in reduced rate of new payload creation, and fewer payloads transmitted across the network. Although reducing CPU usage isn’t a priority, reducing the total number of SHA3 hashes can offset some or possibly all of the computation needed for Atree register inlining.
To summarize the benefits, reducing the number of payloads and overall execution state size will:
- reduce RAM, storage, and network traffic.
- reduce CPU used by garbage collector and SHA3 hashes for inlined payloads.
- also help other projects using Atree payloads (e.g. StoreHouse) use less resources for database indexes and caches.
Preliminary estimates indicate over 15% reduction of the memory used by the execution state, just from eliminating direct overhead associated with inlined payloads. Additional memory savings from reducing indirect overhead, as well as the ripple effect on other components will likely result in additional savings in resource usage, on top of the direct savings.
Our plan is to deploy the updated Atree in Q4 this year. The deployment requires a data migration, so this feature can only be deployed in a spork.
If you are interested in more detail, early description of the projects is here, note that the scope evolved since and this issue does not cover all aspects of the planned changes.
You can also follow the progress of the redesigned Atree implementation by following the PRs linked to this issue: Optimize atree to reduce Flow's mtrie size and reduce number of reads · Issue #292 · onflow/atree · GitHub
Storehouse
Goal of the Storehouse project is to design a new storage layer for the execution state data using Pebble DB, which will enable storage on disk.
This means that we will separate the data payloads that are currently held in-memory in the MTrie data structure and store them in a key-value store on disk.
The MTrie will continue to serve it’s purpose as a data structure to generate proofs for the execution state updates and will for the time being remain in memory.
By storing payloads on disk, we can scale the execution state to hundreds of TBs on a single machine. It will also address the state growth rate issue, because the MTrie data structure grows at a much slower rate than the payloads themselves. The increased storage capacity, combined with slower growth rate of in-memory data, provides space for the execution state to Grow to over 100x of it’s current size.
We have made good progress on the initial draft of the Storehouse design and will share it with the community soon. We are planning to complete the POC in early Q4 and estimate the full solution to be ready for testing by the end of this year. Once deployed, we expect to reduce the memory usage of the Execution node by ~50%. The delivery of Storehouse project is the first step on the path that leads to Flow supporting Petabytes of on-chain storage in a single state.