Move fast, don't break your API

As a sequel to my talk last year on building Stripe's API, I thought it'd be useful to go over how we scaled some of our internal abstractions to continue building and iterating quickly. I gave this talk at APIStrat Chicago a couple of weeks ago and several events at HeavyBit last week, who generously recorded and transcribed the whole thing for anyone who'd like to watch a video. (Thanks, HeavyBit!)

I'm the kind of person who will impatiently watch videos of talks or lectures on 2x speed and greatly prefer reading a blog post that can be skimmed easily, so I decided to write an accompanying post to go with the slides.

Enjoy! (:

Let's build an API!

Stripe has a lot of APIs, and as a result we had to quickly figure out how to scale our abstractions and code. Since code is worth a thousand words (that's a saying, right?), I'll run through an example of building an API.

Here's a super simple example of an API endpoint in Sinatra that creates credit card charges:

post '/charges' do
  # Authentication
  api_key = get_api_key
  user = User.find_by_key(api_key)

  unless user
    return {error: "Invalid API key."}
  end

  # Collect user parameters
  card_number = params[:card_number]
  amount = params[:amount].to_i

  # Validations
  unless card_number.length == 16
    return {error: "Invalid card number."}
  end

  unless amount > 0 and amount <= CHARGE_MAX
    return {error: "Invalid amount."}
  end

  # Actually create the charge
  charge = create_charge(card_number, amount)

  # Return an API response
  json {
    id: charge.id,
    amount: charge.amount
    card_number: charge.redacted_card_number,
    success: charge.success
  }
end

It's starting to get a little crowded in here, but it's more or less a working API. That was easy!™

What next?

As you start adding new endpoints, functionality, and changes you run into more problems.

How do you (scalably) do things like authentication, validation, actual API logic, error handling, authentication, and at the same time support every combination of behaviors that has ever existed in the past? We can't ever break integrations—particularly as a payments processor, broken integrations means our users are literally losing money every minute. We need to be able to build and change things rapidly without compromising any API stability or backwards compatibility.

Once your API has reached a certain size, something that also starts to creep up on you is dependencies. Documentation is a good example of this. Say you build an API and write up docs in the form of static HTML or markdown. It launches and everyone is happy.

A week later you decide to add something to the API. You diligently add it to your API code and maybe even remember to make the change in the docs as well.

With more additions or updates though, sooner or later this is going to happen:

Crap, I forgot to update the docs! — Everyone, ever

Sound familiar? (This has definitely happened to me more than once.) How did we start making these problems less painful?

Separate the layers of responsibility

In the example above, many things are going on at once: authentication, validation, endpoint-specific logic, error handling, and constructing the response.

Separation of responsibilities is CS 101, so we started moving things out and building abstractions around them.

For authentication and error handling, we use Rack middleware and it works pretty well. You shouldn't have to worry about authenticating users in the middle of your API logic (most frameworks have a concept of before filters for this as well). You also shouldn't have to hardcode error response formats. Wouldn't it be nice if you could just throw an error and know that someone else will catch it to format it into the proper response later?

use ErrorHandler
use Authenticator

get '/charges/:id' do
  user = env.user
  id = params[:id]

  if charge = user.get_charge(id)
    json {
      id: charge.id
      ...
    }
  else user.get_charge(id)
    raise UserError.new("No charge #{id}!")
  end
end

We represent endpoint validations, logic, and response generation internally in our code as APIMethods and APIResources. If you're familiar with MVC, they're very similar to controllers and views for your API.

class ChargeCreateMethod < AbstractAPIMethod
  required :amount, :integer
  required :card_number, :string

  resource ChargeAPIResource

  def execute
    create_charge(amount, card_number)
  end
end

class ChargeAPIResource < AbstractAPIResource
  required :id, :string
  required :amount, :integer
  required :card_number, :string
  required :success, :boolean

  def describe_card_number
    charge.redacted_card_number
  end
end

Make it really hard to mess up

A good UX design principle is that you should make it really hard for your users to mess up or do the wrong thing. Why not apply this toward building the API as well?

One thing that we did that I thought was really cool was a system for documenting our API. To try to address "I forgot to update the docs!" syndrome, we made it really hard to forget by putting the documentation right under the code that adds a new property.

class ChargeCreateMethod < AbstractAPIMethod
  required :amount, :integer
  required :card_number, :string

  document :amount, "Amount, in cents."
  document :card_number, "The card number."

  ...
end

Our documentation then auto-generates itself from these specs—for changing most things, there's no need to go dig up static HTML files.

Similarly for our API libraries (or at least those that can support it), we don't hardcode properties for each object but instead dynamically generate them based on the properties present in the response that is received. This way we don't have to worry about adding new fields to object definitions in each library.

Hide your backwards compatibility

We're often asked how we implement our API backwards compatibility. This probably merits an entire talk and post by itself, but I'll go over it at a high level.

When a user starts implementing Stripe for the first time they don't need to worry about API versions. Instead it's invisible—they'll innocently make their first API request, we'll record what internal version they're on, and from then on our code takes care of making sure we never break their integration.

If a user wants to worry about versions they can: we allow overrides to be sent in via request headers and users can upgrade their version via the dashboard. However most people won't care, so they shouldn't have to know about it. All most Stripe users see is that their integration, even if they first wrote it years ago, never breaks.

All of our versions live in the same code base and are deployed to the same service. We don't do separate services or deploys for different versions. The downside of doing is that it's easy for things to get hairy after a while.

First pass

Imagine sometime down the road Stripe decides to deprecate the amount parameter and all charges are now $1. (Disclaimer: I'm pretty sure we will never actually do this.)

What's the naive way to implement that change in behavior? Maybe something like this:

def execute
  if !user.old_version? && params[:amount]
    raise UserError.new("Invalid param.")
  end

  ...

  if !user.old_version?
    response.delete(:amount)
  end
end

There are a couple of problems with this. First, who knows what old_version? is supposed to mean? You can infer it from the code, but it's really not intuitive and is just waiting for accidental regressions. Second, the regular API logic and legacy logic are being mixed together. If someone wanted to add or update the something in the current API (somewhere in that ...), they would have to wade through those extra conditionals.

Gates

We've modeled our versioning system around a series of gates. A "gate" is a database flag (similar to feature flags) that our code can use to determine what functionality to allow. For example, if a user is on an old version and is therefore allowed to send the amount parameter, they're on the allows_amount gate.

We declare all of the versions along with the corresponding functionality gates in a single YAML file:

-
  :version: 2014-09-24
  :new_gates:
    -
      :gate: allows_amount
      :description: >-
        Sending amount is now deprecated.

and are able to decouple versions from the actual behavior they represent in our code:

def execute
  if !user.gating(:allows_amount) && params[:amount]
    raise UserError.new("Invalid param.")
  end

  ...

  if !user.gating(:allows_amount)
    response.delete(:amount)
  end
end

To go one step further and actually get that logic out of the endpoint execution code, separate compatibility layers were added to our code flow. Because who doesn't need more layers of indirection?

Now when a request comes in it first filters through the request compatibility. That layer may or may not reject it based on the parameters (like if someone passes in amount who isn't allowed to) or munge the parameters to something that the later logic expects.

After the response is created, it passes through another layer of response compatibility which transforms the response into whatever the user's version dictates. The great thing about this is that the API logic and response construction steps can represent the current version of the API without any legacy edge case clutter.

It's worth noting that these layers aren't free: there's a certain amount of complexity added when fiddling with requests and responses that are passed through, but we made the tradeoff to be able to keep our general API code (what engineers spend 80-90% of their time working on) clean and easier to reason about.

In the real world

What does this look like in practice? Stripe has (at the time of writing) 106 endpoints, 65 versions, and 6 API client libraries. You can do the combination math yourself.

We would be far behind where we are today if we couldn't find, read, and change code quickly without being afraid of breaking our users.

Conclusion

So design for yourself. Spend some effort not just thinking about how you can optimize your users' experiences, but how you can optimize your own as well.

Stripe's far from perfect; we're learning more and more every day and (like any other startup) there are still very many things in our code that annoy and embarass us. I hope this gives others a sense of things we've learned over the years and helps other developer companies who are tackling the same problems.

If you're ever interested in chatting about any of these topics, I'd love to chat—drop me an email or poke me on Twitter.

I ended up writing far more than I had intended so if you made it down this far, congratulations and thanks for reading!

Credit to Sheena Pakanati, Saikat Chakrabarti, Ross Boucher, Greg Brockman, and many others at Stripe for contemplating, building, and iterating on everything covered in this post.