Kapost Engineering

Recent Posts


Recent Comments


Archives


Categories


Meta


Implementing an API Gateway with GraphQL: Resolution Strategies

Nathanael BeisiegelNathanael Beisiegel

Update: I have published a follow up post on practical implementation advice with an open-sourced GraphQL API Gateway example. Check it out here!

An API Gateway is a microservice pattern where a separate service is built to sit in front of your other back-end services. This service acts as a back-end for front-ends, where it can proxy and unify access to a variety of back-end services. This pattern can allow for varying use-cases, but typically this is done to unify client network access, centralize authentication, and combine data from multiple services’ data stores.

We started investigating this architecture after reaching a breaking-point in our JavaScript clients. As we made efforts to move from a monolith to microservices, more and more complexity was pushed down to the client. Data now stored across different back-end services required clients to make requests across separate APIs and then join into combined data structures for rendering. Often this was not trivial, due to a lack of consistent response shapes and new cases to handle  eventual consistency and referential integrity issues.

Furthermore, an internal push to make endpoints more normalized (to improve back-end cacheability) resulted in more follow-up requests in the client. These new endpoints improved browser caching at the cost of increasing our “network waterfalls”, where the client must make follow-up requests to denormalize data from an original request. Worse still, developers would often be tempted to skip this network waterfall complexity by naïvely fetching unpaginated collections of models to join in later. Obviously, this is an approach that does not scale to large collections and slows down the page with expensive and often wasteful requests.

Ultimately, we realized we needed to take a step back and consider a better approach to improve network performance, trim our client bundle of sizable network code, and reduce the complexity required to implement front-end features. We decided to experiment with an API Gateway over GraphQL. While API Gateways can, of course, be served over REST, we decided to try GraphQL with the promise that it would provide strong typing and a solid mechanism for a generic API that could support new features with little-to-no client changes.

After a year of experience with this new architecture, we believe this was a good starting point to unify an existing set of REST requests, especially when working with a set of cacheable ones. Apollo has also recently recommended this approach with their Principled GraphQL recommendations. However, there can be performance limitations when proxying over REST and other concerns you should be aware of when considering this architecture which we will address.

Since many of the API Gateway articles across the web share from a greenfield, GraphQL-only viewpoint, we thought it would be useful to share lessons learned from implementing this on top of a brownfield set of services. In this post we’ll discuss query resolution and attempt to remove any mystery on how you are going to actually proxy and combine this data. There are different strategies you can take depending on what your APIs can support.

Why might you want to do this? Or not?

If, like us, you find yourself in a situation where you need to stop the bleeding of complexity of many REST requests split across services—then an API Gateway can be a great start. In addition to solving this network problem I’ve just described, an API Gateway provides several organizational benefits:

Of course, this consolidation comes at a cost. Like other proxies, it presents another point of failure in your back-end (a point of failure logically speaking—note that the described API Gateway functionality can easily be scaled vertically and horizontally). It also presents an additional layer for your team to learn and maintain, which can turn into extra work when implementing new endpoints and growing this new API.

Additionally, implementing GraphQL over REST can present performance bottlenecks when resolving at the network layer if your APIs are slow. We should note that this performance bottleneck is the same as your clients’ when dealing with data joins, so it is often better to still move to the API Gateway. Be aware that it can make the total request slower when stacking serial joins over network requests. If you don’t have the problem where the data splits across services are painful, then you are likely better off just implementing a GraphQL endpoint in a primary service first. You are more likely to benefit from resolving at the ORM / database level then you would from any potential REST caching over a API Gateway.

Why GraphQL?

I don’t want to rehash the common discussion of pros/cons for GraphQL, but I will mention a few qualities that made GraphQL an obvious candidate for an API Gateway. In short, it’s a ready-made ecosystem and a perfect fit for this functionality. If you were to build an API Gateway over REST from scratch, you would have to establish new shapes and conventions for a new interface. GraphQL provides the tools to quickly build these with typing, mocking, documentation, and playground functionality for free. Apollo Server and graphql-tools make resolving this data often as simple as implementing a single function to fetch and optionally reshape the responses. We’ve found these tools to provide an excellent developer experience, especially with the huge improvement in API discoverability and documentation with GraphQL Playground out of the box. We also appreciate how we can benefit from and contribute to the client tools and front-end community which are actively growing and improving. It also supports growing our usage of GraphQL with schema stitching if another service opts to provide a GraphQL endpoint.

Of course, if your data is truly cacheable over REST (i.e. public and unchanging from user-to-user) then of course GraphQL may not be the right tool for your use case. We recommend thinking through the “cacheability spectrum” diagram below, especially for follow up requests.

Resolution

I’m going to go into the weeds about GraphQL resolution now. Stay with me—it’s difficult to summarize this and it’s important to consider if your APIs can support the different kinds of requests GraphQL may proxy. If you are unfamiliar with the GraphQL schema below, I recommend you read the docs on GraphQL querying and schema definition before continuing.

GraphQL queries are resolved from top to bottom, starting from the top of the query. Resolution follows each field. Nested objects typically represent a join, although depending on how denormalized your endpoint is it may not represent another request. If we consider lists and joins, this means our resolvers must deal with the following types of relations: 1:1, 1:N, M:1, and M:N.

1:1 queries are simple—either a root query or a simple followup query for a singular resource when nested.

query GetAPlaylist {
  playlist(id: "foo") {
    title
    creator {
      name
      avatarUrl
    }
  }
}

Or, in a resolution plan:

Get the playlist,
    Then, get the creator by playlist.creatorId

1:N queries are similar but fetch a list (usually paginated).

query GetPlaylistAndSongs {
  playlist(id: "foo") {
    songs(first: 20) {
      title
      length
      songLink
    }
  }
}
Get the playlist
    Then, get a page of 20 songs by playlist.id

M:1 queries require an identifier on the parent type, and will fetch a set of items by id.

query GetPlaylistsAndSongs {
  playlists(first: 20) {
    creator {
      name
      avatarUrl
    }
  }
}
Get the first 20 playlists
    then, find all the creators by each playlist.creatorId (ideally batching into one request with multiple IDs)

M:N joins are the most complex: they require a page for each parent item.

Get the first 20 playlists
    Then, for each playlist, get a page of songs by playlist.id (ideally batching into one request for multiple pages)

M:N joins are the most expensive thing to resolve, assuming the child relationship is paged. You may not have any of them in your existing APIs. If resolving at a database level you would likely JOIN using SQL for that resolver. Resolving this data over endpoints may require some specialized endpoints which we will get to.

Resolving over multiple REST endpoints: a practical example

Hopefully you are still following along! I’d like to show a practical example of resolving requests over REST for each of the joins listed above.

We will use the following schema in the following examples. It includes some simplified types for a music streaming service.

type Query {
  currentUser: UserProfile!
  playlists: [Playlist!]!,
  friendActivity: [FriendListen!]!
}

type UserProfile {
  name: String!
  avatarUrl: URL!
}

type Playlist {
  name: String!
  songs(page: Int, pageSize: Int = 25): [Song!]!
}

type FriendListen {
  friend: UserProfile!,
  listenedTo: Song!
  fromPlaylist: Playlist
}

type Song {
  title: String!
  streamUrl: URL!
  artists: [Artist!]!
}

type Artist {
  name: String!
}

We could imagine these types support a “homepage” type of query like the following:

query HomeQuery {
  currentUser {
    name
    avatarUrl
  }
  playlists {
    name
    createdBy {
      name
      avatarUrl
    }
    songs(page: 1, pageSize: 20) {
      title
      streamUrl
      artists {
        name
      }
    }
  }
  friendActivity {
    friend {
      name
      avatarUrl
    }
    listenedTo {
      title
      streamUrl
    }
    fromPlaylist {
      name
    }
  }
}

In our backend, we imagine that we have two back-end services. The first, primary service manages the core domain of users, playlists, and songs. The second service manages FriendActivity, opting to store this flood of data in a different time-series database.

Currently both services provide very normalized, cacheable REST endpoints, where no information is denormalized on the User, Playlist, and Song endpoints.

Using the apollo-server/graphql-tools resolver functions, our top level resolution logic might look something like the following. We use an instance of API (a fetch/axios instance that’s created per request).

// Query.userProfile
async function userProfileResolver(_obj, _arg, context) {
  const currentUserProfile = await context.api.fetch("/service1/profile");

  // Transforming shape to match GraphQL schema when necessary
  return {
    name: currentUserProfile.response.userName,
    avatarUrl: currentUserProfile.response.avatar
  };
}
// Query.playlists
function playlistResolver(_obj, _arg, context) {
  // API looks like the following:
  // [{ id: "a", name: "Mixtape the 2nd", createdBy: "234234acd" }, ...]
  // Note that this endpoint currently does not return songIds.
  return context.api.fetch("/service1/playlists");
}
// Query.friendActivity
function friendActivityResolver(_obj, _arg, context) {
  // API looks like the following:
  // [{ userId: "abc",  }, ...]
  return context.api.fetch("/service2/friendActivity");
}

Root queries are easy, since there is no joining and it’s just proxying requests. However, the second layer gets more challenging. Let’s look at FriendListen to get the friends. We have a friendId for each activity, so we have a N:1 join where we need one user for each friendListen event. We use data-loader in a new Model class to allow us to batch these requests together

// Model object that's uses `data-loader` to debounce and
// combine requests into one call. This model is created
// per request and is available in `context`.
class User {
  constructor(api) {
    this.api = api
    this.artistLoader = new DataLoader(this.artistLoader);
  }

  get = (id) => {
    return this.instanceMemberLoader.load(id);
  }

  batchFetch = (ids) => {
    return (
      this.api.get("/service1/users", { ids });
    );
  }
}

// FriendListen.friend
function friendResolver(friendListen, _arg, context) {
  return context.models.user.get(friendListen.userId);
}

We can follow the same pattern for all N:1 requests, including FriendList.listenedTo, FriendList.fromPlaylist, and Playlist.createdBy. Even better—by using the same model across resolvers we can deduplicate userIds that we may have already asked for, as data-loader is smart enough to cache responses and deduplicate ids. We note that for all N:1 joins, we need an endpoint that can return results for multiple ids (which have named a “multishow” endpoint).

Finally, we are left with one more challenging query: Playlist.songs. This is our first M:N join, where we need to fetch 20 songs for each playlist item. The songs endpoint is paged as a playlist could have many songs. We now require the ability to batch multiple paged index endpoints into a single endpoint (a “multiindex”). This is an endpoint you likely do not have today—but it is one that allows our data-loader to batch together index requests into one network request. We provide a new cacheKeyFn that consistently serializes objects into a stable cache key so that the same set of params are deduplicated.

// Model object that's uses `data-loader` to debounce and
// combine requests into one call. This model is created
// per request and is available in `context`.
class Songs {
  constructor(api) {
    this.api = api
    this.songLoader = new DataLoader(this.batchMultishow);
    this.songPageLoader = new DataLoader(this.batchMultiindex, {
      cacheKeyFn: stableSerializeObject
    });
  }

  get = (id) => {
    return this.songLoader.load(id);
  }

  getPage = (params) => {
    return this.songPageLoader.load(params);
  }

  batchMultishow = (ids) => {
    return (
      this.api.get("/service1/songs/multishow", { ids });
    );
  }

  batchMultiindex = (paramsList) => {
    return (
      this.api.get("/service1/songs/multiindex", paramsList);
    );
  }
}

// Playlist.songs
function playlistSongsResolver(playlist, args, context) {
  return context.models.songs.getPage({
    page: args.page,
    pageSize: args.pageSize,
    playlistId: playlist.id
  });
}

M:N joins are the most complex due to the paging—at this point you may be thinking that this puts a heavy expectation on the back-end with this multiindex endpoint. It’s possible you may need none or very few of these, but realize you will face this complexity as soon as you need a nested, paged query in your GraphQL endpoint.

Let’s look at our final resolver: Songs.artists. You might think this is also another M:N join due to the array, but we don’t need to page the artists endpoint as the number of artists per song is always a fairly small constant (typically 1-3 artists per song). Assuming the songs response included artistIds for each song, we can just rely on a “multishow” endpoint.

// Model object that's uses `data-loader` to debounce and
// combine requests into one call. This model is created
// per request and is available in `context`.
class User {
  constructor(api) {
    this.api = api
    this.artistLoader = new DataLoader(this.batchFetch);
  }

  // using https://github.com/facebook/dataloader#loadmanykeys
  getMany = (ids) => {
    return this.artistLoader.loadMany(ids);
  }

  batchFetch = (ids) => {
    return (
      this.api.get("/service1/artists/multishow", { ids });
    );
  }
}

// Song.artists
function songArtists(song, _arg, context) {
  return context.models.artists.getMany(song.artistIds);
}

Phew, that was a lot to take in, but let’s recap our example.

Now that we understand the scope of what generic resolution over REST looks like, let’s talk about optimization. M:N queries are the most difficult to resolve, and if you cannot resolve these efficiently over an endpoint you will need to resolve these joins at a database instead of REST.

The Resolution Spectrum and Caching

So far we have assumed our APIs are flat structures. Your APIs may function differently, where they join lots of data in already. You may find this middle-ground ok by mix and matching joining between the API Gateway and your service, but you can lose cacheability.

We should also consider the opposite end of that spectrum: where our API joins completely (within a boundary of a service). In this world view, we let our service do a complete join. The most configurable endpoint would be another GraphQL endpoint! Of course, we lose the ability to cache over REST with this strategy.

A diagram representing how cacheable a request is. More joined / parameterized requests are less cacheable at the REST level, and less joined and parameterized requests are more cacheable.

Resolvers are essentially our own query plan. We can decide to join, or we can defer to a database to join within a service. You could have your API resolve all of the data, but shoving more into one endpoint gets you closer and closer to another GraphQL endpoint!

In general, I would recommend preferring extremes with your endpoints. Either pursue the goal of more cacheable and fast normalized endpoints and join them in your gateway, or prefer GraphQL endpoints in your other services and schema stitch them together. Endpoints in the middle don’t buy you as much. Observant readers may have considered that M:1 and M:N requests are less likely to be cached. We recommend reading over the Principled GraphQL point on this topic—often the resolution over HTTP is worth the reuse of APIs.

Recommendations

I hope this post was useful when considering what you must fulfill as an API Gateway. I hope I haven’t scared you off the idea: M:N queries may be nice but ultimately unnecessary depending on your use case. We have a few recommendations to leave you with:

Network Performance Prerequisites

Your network must resolve in a reasonable amount of time. You really should have sub-300ms times for each request made from the API Gateway, especially when nesting data. This can be quite challenging if you are performing M:N multiindex queries but can be done with specific endpoints and the right queries/database underneath. We were able to do this with a denormalized datastore.

Don’t nest too many network joins

We recommend against going too deep with nested layers of network joining deep when using GraphQL over an existing set of APIs. Stacking serial requests can add up and you have a limited budget for your request timing. This can be tricky but you can design a schema that avoids deep type resolution if you are careful. As a rule of thumb, multiply the average speed of your API responses (i.e. 200ms) times the nesting level (i.e. 4 nested queries) and budget below 1 second. Query cost and depth limiting can help you prevent this as well if you don’t want to limit your types.

Cache REST requests at the Gateway

We used a redis store with some custom header checks for this, but you also use an off-the-shelf solution such as Squid or Vanish and put it in front of your back-end requests. We decided the implementation was simple enough to avoid the additional network hop.

If this architecture is intriguing to you, we recommend checking out our open-sourced GraphQL API Gateway project on GitHub, where we discuss some practical patterns and implementation notes for building performant and well-organized API resolvers. I’ve written a follow-up blog post on some practical implementation advice that I hope you’ll check out.

We are hiring! This project was open-sourced by Kapost, a content operations and marketing platform developed in beautiful Boulder, CO. If this project and similar challenges sound fun to you, check out our careers page and join our team!

Nathanael is a software engineer at Kapost. He is passionate about delivering better features and UX by improving front-end architecture, tooling, and education. Follow him on Twitter @NBeisiegel.

Comments 0
There are currently no comments.