Scaling Node Operations at Coinbase - OhNo WTF Crypto

Breaking News

Scaling Node Operations at Coinbase

#OhNoCrypto

Tl;dr: This blog shares insights on how Coinbase is investing in new tools and processes to scale its node operations.

By Min Choi, Senior Engineering Manager — Crypto Reliability

Blockchain nodes power almost every user experience at Coinbase. We use them to monitor fund movements, help our customers earn their staking rewards, and build the analytics needed to support popular features within our applications. As such, being able to effectively manage blockchain nodes is vital to our core business and we are continuing to invest in ways to scale our node operations.

One of the most difficult aspects of node management is keeping up with the constant, and sometimes unpredictable, changes to the node software. Asset developers are consistently releasing new code versions and some blockchains, such as Tezos, leverage an on-chain governance model to take a community vote on all proposed changes. A decentralized governance model such as this makes it difficult to predict when a change will be introduced and prepare our internal systems in advance. An example of such a scenario is depicted in the below Messari alert.

Data provided by https://messari.io/

The consequences of not keeping up with these changes can be severe to our customers. They could cause long delays to balance updates in our core wallets or slashed staking rewards. To help minimize these incidents from occurring, we’re focusing investments into the following areas:

Asset Release Manager

This service gives us an extra pair of hands (or should I say “ARM”) to process common node upgrades. All puns aside, the ARM service monitors Github release activity for dozens of critical blockchains and automates the deployment of new node binaries to our non-production environments. This frees up our engineers to focus on service validations and work proactively with asset developers to resolve problems prior to production release.

The below diagram shows the high level data flow for ARM.

Here’s a recent example of how the ARM service was leveraged to process a node upgrade for Algorand.

  • On May 9 at 12:44 PM PDT, Algorand version 3.6.2 was released.
  • On May 9 at 1:13 PM PDT, the ARM service filed a ticket to notify our engineers and track the incoming change.
  • On May 9 at 1:43 PM PDT, the required code change was automatically generated for build and deployment.
  • On May 9 at 2:13 PM PDT, the change was automatically deployed to all our non-production environments for Algorand.
  • On May 9 at 2:43 PM PDT, an error in one of the three deployments was detected and the ARM service escalated to an engineer to help investigate.
  • On May 10 at 6:27 AM PDT, the engineer resolved the deployment problem and began service validation testing in preparation for production deployment.

As seen above in this event chronology, the system isn’t completely touchless, meaning engineers are still needed as part of the overall upgrade process. However, the ARM service allows us to transact hundreds of these upgrade operations in parallel, saving countless hours of engineering time which can then be reinvested into quality assurance efforts.

Test-Runner

This is an orchestration service used to execute integration tests, both via temporal workflows and API calls to critical systems across Coinbase. As the name may suggest, Test-Runner obtains and stores test results, aggregates them by metadata, and exposes an API to query the results. By making it simple to create these tests and share standardized test results across our engineering teams, we’re able to accelerate our asset addition and incident response processes. We put a lot of value in building reusable integration tests as we view them as a foundation of our asset maintenance regime.

The below diagram shows the high level service architecture for Test-Runner.

Here are also a few basic examples of the types of tests that are in scope for Test-Runner.

  1. Balance transfers within Coinbase.
  2. Deposits and withdrawals in and out of Coinbase.
  3. Sweep and restore operations between cold and hot wallets.
  4. Simple trade operations (buy/sell).
  5. Rosetta validation.

Each time a node is upgraded, these tests are automatically triggered through our continuous integration (CI) pipeline, providing a clear validation of success or failure. This helps our engineers make quick and informed operational decisions such as rolling back to a previous version of the node binary.

Blockchain Pods

As we add more blockchains to our support catalog, we’re investing in flexible engineering teams designed to collaborate on emerging priorities. Our pods are approximately 5–7 engineers in size, are made up of site reliability and software engineers, and offer opportunities to quickly adapt to shifting market conditions. For example, we most recently formed a pod to focus specifically on Ethereum’s upcoming transition from a Proof-of-Work (POW) to a Proof-of-Stake (POS) blockchain. The Merge is a very large and extremely complex change, requiring nearly all Coinbase systems to adjust, but is also merely a one time event that doesn’t justify the formation of a permanent engineering team.

We’re also in the process of forming new pods to focus on ERC-20 (Tokens) and ERC-721 (NFTs). In this way, we can pivot on the development of features that harness these standards for the betterment of our customers. By constantly forming and dissolving pods in this manner, we’re able to develop small economies of scale that quickly meet our customer needs. It also gives our engineers the flexibility to choose between areas of technological interest and build subject matter expertise that help them grow their careers at Coinbase.

Final Thoughts

Developing a comprehensive strategy for node management is a challenging endeavor. While we acknowledge that our own strategy is not without flaws, we take pride in operating at the cutting edge of blockchain technology. Everyday, Coinbase engineers work tirelessly in partnership with the greater crypto community to overcome these operational challenges. So if you’re interested in building the financial system of the future, check out the openings on the Crypto Reliability (CREL) team at Coinbase.


Scaling Node Operations at Coinbase was originally published in The Coinbase Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.



OhNoCoinbase via https://www.ohnocrypto.com/ @Coinbase, @Khareem Sudlow