Scaling Execution Node

Introduction

Execution nodes are one of the critical components of the Flow network. They are the powerhouse which crunches all transactions and stores the data that the users have in their wallets. They are the essential piece of the puzzle to scaling transaction throughput and the amount of data stored on-chain.

All data that is updated through an execution of a Cadence smart contract is called “Execution data” and is stored in a Memory-trie (MTrie) data structure. MTrie serves 2 purposes - it holds all execution state data and also calculates a new hash every time the data changes. This hash is used by verification nodes to validate the execution result.

We chose the simple solution for execution state storage, which was to store the whole state in memory. It enables fast execution and we knew we have a lot of space to grow before we need to implement more scalable solution.

As of July 2023, FLOW is running at ~10% transaction throughput capacity (at the average of ~650K daily transactions in 2023 so far), so we still have room to grow.
Execution state growth has been a bigger challenge for us over the last year. Despite continuously optimizing the execution state storage, we had to increase memory on execution nodes this year to keep up with the growing execution state (the average memory used is hovering around 700GB of RAM) and we are again getting close to 90% of memory used during peak loads.

The purpose of this article is to explain the challenges we face with the execution node, our priorities and approach in solving those challenges. The planned changes described below will not require any changes in smart contracts or dapps running on flow.

Solving the problem

We are working on 2 initiatives in parallel to address the challenges with growing size of the execution state - Optimizing the in-memory execution state data structure (“Atree register inlining” project) and execution state storage on disk (”Storehouse” project).

Atree Register Inlining

The primary goal of the Atree Register Inlining project is to reduce the memory usage of Execution Nodes, while also reducing usage of storage, network and CPU.
Secondary goal is to slow down the growth rate of memory and storage required by Execution Nodes.

Atree is used by Cadence execution runtime to manage and store data in units called payloads. Each payload has direct and indirect overhead which consumes resources (memory, storage, network and CPU). For example, each payload requires extra set of data to be created and stored (including its own SHA3 hash).

The redesigned version of Atree will merge small payloads, which will reduce total number of payloads by over 50%. This will result in reduced rate of new payload creation, and fewer payloads transmitted across the network. Although reducing CPU usage isn’t a priority, reducing the total number of SHA3 hashes can offset some or possibly all of the computation needed for Atree register inlining.

To summarize the benefits, reducing the number of payloads and overall execution state size will:

  • reduce RAM, storage, and network traffic.
  • reduce CPU used by garbage collector and SHA3 hashes for inlined payloads.
  • also help other projects using Atree payloads (e.g. StoreHouse) use less resources for database indexes and caches.

Preliminary estimates indicate over 15% reduction of the memory used by the execution state, just from eliminating direct overhead associated with inlined payloads. Additional memory savings from reducing indirect overhead, as well as the ripple effect on other components will likely result in additional savings in resource usage, on top of the direct savings.

Our plan is to deploy the updated Atree in Q4 this year. The deployment requires a data migration, so this feature can only be deployed in a spork.

:bulb: If you are interested in more detail, early description of the projects is here, note that the scope evolved since and this issue does not cover all aspects of the planned changes.
You can also follow the progress of the redesigned Atree implementation by following the PRs linked to this issue: Optimize atree to reduce Flow's mtrie size and reduce number of reads · Issue #292 · onflow/atree · GitHub

Storehouse

Goal of the Storehouse project is to design a new storage layer for the execution state data using Pebble DB, which will enable storage on disk.

This means that we will separate the data payloads that are currently held in-memory in the MTrie data structure and store them in a key-value store on disk.

The MTrie will continue to serve it’s purpose as a data structure to generate proofs for the execution state updates and will for the time being remain in memory.

By storing payloads on disk, we can scale the execution state to hundreds of TBs on a single machine. It will also address the state growth rate issue, because the MTrie data structure grows at a much slower rate than the payloads themselves. The increased storage capacity, combined with slower growth rate of in-memory data, provides space for the execution state to Grow to over 100x of it’s current size.

We have made good progress on the initial draft of the Storehouse design and will share it with the community soon. We are planning to complete the POC in early Q4 and estimate the full solution to be ready for testing by the end of this year. Once deployed, we expect to reduce the memory usage of the Execution node by ~50%. The delivery of Storehouse project is the first step on the path that leads to Flow supporting Petabytes of on-chain storage in a single state.

1 Like

This is very nice improvement, I have few questions.

  • With register inlining, is there a plan to increase slab size?
  • Is there any work on improving proof sizes?
  • is ~50% saving coming from StoreHouse only, or is it combined with atree inlining?
  • what will be the average transaction cost increase (if any) in percentage? Is there any plan to increase gas limit?

The increased storage capacity, combined with slower growth rate of in-memory data, provides space for the execution state to Grow to over 100x of it’s current size.

  • What is the size of mtrie in memory currently ? And what is the expected size afterwards?

With register inlining, is there a plan to increase slab size?

Need to ask the experts, stay tuned :slight_smile:

Is there any work on improving proof sizes?

What do you mean by improving proof sizes ?

is ~50% saving coming from StoreHouse only, or is it combined with atree inlining?

We estimate Atree register inlining will reduce the memory usage by ~15%, maybe bit more. Storehouse will build on this and reduce the memory usage by additional ~50% as the memory used by payloads and the tree nodes is right now about 50/50.

what will be the average transaction cost increase (if any) in percentage? Is there any plan to increase gas limit?

AFAIK there is no plan to increase Tx fees as a result of those 2 projects.

With the register inlining, is there a plan to increase slab size?

We don’t have any active plan to increase the slab size right away, just to clarify things here, what we say as register inlining here is a dynamic one, which means it would only inline a register if its small enough and parent has enough room, i.e. not all children maps are going to be inlined. Last time I checked this work facilitates in-lining a lot of small registers that have good room on parent (e.g. NBA top shot moment). After this work is ready we are going to run experiments on mainnet data and see how much slab capacity problem it might be cause (if any significant) and if its going to be solved by increasing the slab size, sure we could plan that work.

Is there any work on improving proof sizes?

If you mean proof size for register inclusion, we also have some other work planned for the trie storage after separating the payload storage which is not included here yet, that would significantly reduce the proof size for a batch of registers, especially the one stored under one account. stay tuned as we share more updates about it.

1 Like

Hi Zizipus, Execution node is executing all transaction code, yes, but it is definitely not the only node doing all the operations from the protocol POV. We have Collection, Consensus and Verification nodes performing many operations to produce, validate and seal the blocks. By having specialized nodes we have the ability to scale different operations independently - this is what makes flow very efficient! It means that nodes have different requirements on HW and scalability in general, Execution node is the heaviest node in the network when it comes to compute and storage requirements.
You can read intro to the Flow protocol architecture here if it helps: https://medium.com/@jan-bernatik/introduction-to-flow-blockchain-7532977c8af8, I also really recommend this blog from Packy McCormick, he is explaining the scalability approach Flow is taking really well: Flow: The Normie Blockchain.

We’ve been working on the design of storehouse for a while and discussed the details for a few rounds. Now, I’m excited to share the design doc, which summarizes our thoughts on the design principles, module functionalities, special cases to be aware of, as well as technical tradeoffs and decisions.