Rearchitecting apps for scale - OhNo WTF Crypto

Breaking News

Rearchitecting apps for scale

#OhNoCrypto

How Coinbase is using Relay and GraphQL to enable hypergrowth

By Chris Erickson, Terence Bezman

A little over a year ago, Coinbase completed the migration of our primary mobile application to React Native. During the migration, we realized that our existing approach to data (REST endpoints and a homebuilt REST data fetching library) was not going to keep up with the hypergrowth that we were experiencing as a company.

“Hypergrowth” is an overused buzzword, so let’s clarify what we mean in this context. In the 12 months after we migrated to the React Native app, our API traffic grew by 10x and we increased the number of supported assets by 5x. In the same timeframe, the number of monthly contributors on our core apps tripled to ~300. With these additions came a corresponding increase in new features and experiments, and we don’t see this growth slowing down any time soon (we’re looking to hire another 2,000 across Product, Engineering, and Design this year alone).

To manage this growth, we decided to migrate our applications to GraphQL and Relay. This shift has enabled us to holistically solve some of the biggest challenges that we were facing related to API evolution, nested pagination, and application architecture.

API evolution

GraphQL was initially proposed as an approach to help with API evolution and request aggregation.

Previously, in order to limit concurrent requests, we would create various endpoints to aggregate data for a particular view (e.g., the Dashboard). However, as features changed, these endpoints kept growing and fields that were no longer used could not safely be removed — as it was impossible to know if an old client was still using them.

In its end state, we were limited by an inefficient system, as illustrated by a few anecdotes:

  1. An existing web dashboard endpoint was repurposed for a new home screen. This endpoint was responsible for 14% of our total backend load. Unfortunately, the new dashboard was only using this endpoint for a single, boolean field.
  2. Our user endpoint had become so bloated that it was a nearly 8MB response — but no client actually needed all of this data.
  3. The mobile app had to make 25 parallel API calls on startup, but at the time React Native was limiting us to 4 parallel calls, causing an unmitigatable waterfall.

Each of these could be solved in isolation using various techniques (better process, API versioning, etc.), which are challenging to implement while the company is growing at such a rapid rate.

Luckily, this is exactly what GraphQL was created for. With GraphQL, the client can make a single request, fetching only the data it needs for the view it is showing. (In fact, with Relay we can require they only request the data they need — more on that later.) This leads to faster requests, reduced network traffic, lower load on our backend services, and an overall faster application.

Nested pagination

When Coinbase supported 5 assets, the application was able to make a couple of requests: one to get the assets (5), and another to get the wallet addresses (up to 10) for those assets, and stitch them together on the client. However, this model doesn’t work well when a dataset gets large enough to need pagination. Either you have an unacceptably large page size (which reduces your API performance), or you are left with cumbersome APIs and waterfalling requests.

If you’re not familiar, a waterfall in this context happens when the client has to first ask for a page of assets (give me the first 10 supported assets), and then has to ask for the wallets for those assets (give me wallets for ‘BTC’, ‘ETH’, ‘LTC’, ‘DOGE’, ‘SOL’, …). Because the second request is dependent on the first, it creates a request waterfall. When these dependent requests are made from the client, their combined latency can lead to terrible performance.

This is another problem that GraphQL solves: it allows related data to be nested in the request, moving this waterfall to the backend server that can combine these requests with much lower latency.

Application architecture

We chose Relay as our GraphQL client library which has delivered a number of unexpected benefits. The migration has been challenging in that evolving our code to follow idiomatic Relay practices has taken longer than expected. However, the benefits of Relay (colocation, decoupling, elimination of client waterfalls, performance, and malleability) have had a much more positive impact than we’d ever predicted.

Simply put, Relay is unique among GraphQL client libraries in how it allows an application to scale to more contributors while remaining malleable and performant.

These benefits stem from Relay’s pattern of using fragments to colocate data dependencies within the components that render the data. If a component needs data, it has to be passed via a special kind of prop. These props are opaque (the parent component only knows that it needs to pass a {ChildComponentName}Fragment without knowing what it contains), which limits inter-component coupling. The fragments also ensure that a component only reads fields that it explicitly asked for, decreasing coupling with the underlying data. This increases malleability, safety, and performance. The Relay Compiler in turn is able to aggregate fragments into a single query, which avoids both client waterfalls and requesting the same data multiple times.

That’s all pretty abstract, so consider a simple React component that fetches data from a REST API and renders a list (This is similar to what you’d build using React Query, SWR, or even Apollo):

A few observations:

  1. The AssetList component is going to cause a network request to occur, but this is opaque to the component that renders it. This makes it nearly impossible to pre-load this data using static analysis.
  2. Likewise, AssetPriceAndBalance causes another network call, but will also cause a waterfall, as the request won’t be started until the parent components have finished fetching its data and rendering the list items. (The React team discusses this in when they discuss “fetch-on-render”)
  3. AssetList and AssetListItem are tightly coupled — the AssetList must provide an asset object that contains all the fields required by the subtree. Also, AssetHeader requires an entire Asset to be passed in, while only using a single field.
  4. Any time any data for a single asset changes, the entire list will be re-rendered.

While this is a trivial example, one can imagine how a few dozen components like this on a screen might interact to create a large number of component-loading data fetching waterfalls. Some approaches try to solve this by moving all of the data fetching calls to the top of the component tree (e.g., associate them with the route). However, this process is manual and error-prone, with the data dependencies being duplicated and likely to get out of sync. It also doesn’t solve the coupling and performance issues.

Relay solves these types of issues by design.

Let’s look at the same thing written with Relay:

How do our prior observations fare?

  1. AssetList no longer has hidden data dependencies: it clearly exposes the fact that it requires data via its props.
  2. Because the component is transparent about its need for data, all of the data requirements for a page can be grouped together and requested before rendering is ever started. This eliminates client waterfalls without engineers ever having to think about them.
  3. While requiring the data to be passed through the tree as props, Relay allows this to be done in a way that does not create additional coupling (because the fields are only accessible by the child component). The AssetList knows that it needs to pass the AssetListItem an AssetListItemFragmentRef, without knowing what that contains. (Compare this to route-based data loading, where data requirements are duplicated on the components and the route, and must be kept in sync.)
  4. This makes our code more malleable and easy to evolve — a list item can be changed in isolation without touching any other part of the application. If it needs new fields, it adds them to its fragment. When it stops needing a field, it removes it without having to be concerned that it will break another part of the app. All of this is enforced via type checking and lint rules. This also solves the API evolution problem mentioned at the beginning of this post: clients stop requesting data when it is no longer used, and eventually the fields can be removed from the schema.
  5. Because the data dependencies are locally declared, React and Relay are able to optimize rendering: if the price for an asset changes, ONLY the components that actually show that price will need to be re-rendered.

While on a trivial application these benefits might not be a huge deal, it is difficult to overstate their impact on a large codebase with hundreds of weekly contributors. Perhaps it is best captured by this phrase from the recent ReactConf Relay talk: Relay lets you, “think locally, and optimize globally.”

Where do we go from here?

Migrating our applications to GraphQL and Relay is just the beginning. We have a lot more work to do to continue to flesh out GraphQL at Coinbase. Here are a few things on the roadmap:

Incremental delivery

Coinbase’s GraphQL API depends on many upstream services — some of which are slower than others. By default, GraphQL won’t send its response until all of the data is ready, meaning a query will be as slow as the slowest upstream service. This can be detrimental to application performance: a low-priority UI element that has a slow backend can degrade the performance of an entire page.

To solve this, the GraphQL community has been standardizing on a new directive called @defer. This allows sections of a query to be marked as “low priority”. The GraphQL server will send down the first chunk as soon as all of the required data is ready, and will stream the deferred parts down as they are available.

Live queries

Coinbase applications tend to have a lot of rapidly changing data (e.g. crypto prices and balances). Traditionally, we’ve used things like Pusher or other proprietary solutions to keep data up-to-date. With GraphQL, we can use Subscriptions for delivering live updates. However, we feel that Subscriptions are not an ideal tool for our needs, and plan to explore the use of Live Queries (more on this in a blog post down the road).

Edge caching

Coinbase is dedicated to increasing global economic freedom. To this end, we are working to make our products performant no matter where you live, including areas with slow data connections. To help make this a reality, we’d like to build and deploy a global, secure, reliable, and consistent edge caching layer to decrease total roundtrip time for all queries.

Collaboration with Relay

The Relay team has done a wonderful job and we’re incredibly grateful for the extra work they’ve done to let the world take advantage of their learnings at Meta. Going forward, we would like to turn this one-way relationship into a two-way relationship. Starting in Q2, Coinbase will be lending resources to help work on Relay OSS. We’re very excited to help push Relay forward!

Are you interested in solving big problems at an ever-growing scale? Come join us!


Rearchitecting apps for scale was originally published in The Coinbase Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.



OhNoCoinbase via https://www.ohnocrypto.com/ @Coinbase, @Khareem Sudlow