A few years ago, we began a fun and challenging journey to break a large, monolithic codebase into a set of isolated, independent REST services. This effort has already yielded a ton of value in simplifying our codebase and speeding up development.
Along the way, we wrote this guide to building services in our ecosystem. We thought other folks embarking on this path might find it useful, so we’re sharing it here.
Use REST and JSON
REST is an approach to building web services that encourages speaking in terms of entities, and uses HTTP thoroughly to interact with data and carry out operations. See the HTTP spec for helpful insight into the details of REST.
Treat interface as fundamentally important
In REST, the interface is key. If we’re doing it right, we shouldn’t find ourselves wanting to hide service calls behind layers of abstraction on the client side. Rather, the REST interface is the programming interface. This means we should think hard about how we name endpoints, and name them for the nouns they represent. Responses are a crucial part of the interface, too. They should consist solely of data that represents the resource.
Build features as composable building blocks
Whenever possible, resources should anticipate being used for multiple sites in multiple contexts. It’s often useful to ask, “Would this resource make sense if we were building a t-shirt company?” It’s the caller who should make the functionality be specific to the application.
Aim for as few resources as are needed
We should see each resource (in this context, “resource” means endpoint) as somewhat precious. For each one we have, we introduce overhead for maintaining the code, the tests, and the documentation. For example, a resource per photo attribute is too granular—instead, set attributes by POSTing to the resource that represents the photo as a whole.
Manage dependencies locally
Services should have localized, self-managed dependencies. In the case of Perl, this means specifying packages with Carton and
cpanfile and managing with
cpanm. Node.js, Ruby, and PHP all have similar systems which Rock and our build system support. There’s overhead that comes with this approach, but the upside is that we get to upgrade dependencies granularly. In practice, in large, tightly-coupled systems, dependencies are almost never upgraded due to the chance of breaking something.
Avoid services calling other services
If you find yourself wanting to call a service from within another service, take that as a sign to step back and evaluate where lines of separation are falling. Very often, the client can call one service and then call the next one, rather than the first directly calling the second. This way we don’t have call stacks multiple services deep, and we can test our resources in isolation.
Services own their own caching
Services should cache their own data as appropriate. Clients may cache according to how the service specifies in response headers, but that should not be the expectation. A memcache pool specific to the service is often the way to go.
Services own their own data
Ideally, services should own their own data. That means instead of storing their data in the main application database, they’d store data in their own local data store, whether it’s a set of MariaDB boxes, or some other data store like Redis.
Services own their own security
Services should take it upon themselves to authenticate and validate incoming requests. In some cases that means integrating with OAuth in the case of actions being taken on the behalf of end users. In other cases that means managing an internal set of users, as our storage services do. Either way, services generally shouldn’t just trust that callers are authorized to do what they’re requesting. Ideally we’d like to be able to open services to the outside world someday. In practice, we use api.shutterstock.com as an additional line of defense for outside users.
Each service lives in its own repository
Repositories for services follow the naming convention
Use middleware to share functionality across services
Middleware is packaged utility functionality you may want across projects, that may be likely to run on every request. For example, in a user-facing site this may include setting up a session, translating the page output, or assigning a visitor ID, etc. In backend services this could include setting up logging, or configuring a caching layer, etc. Modern web frameworks implement some derivative of WSGI, which is a common interface that facilitates sharing across projects.
Code and Branching
Discuss your changes with project stakeholders
Each repository has an associated Google Groups list. Before you make any significant changes, please get feedback there. That list goes to watchers of the repository. You are encouraged to link to a diff or pull request.
Develop and test on your local instance
To get started, clone the repository to some environment where Rock is installed. Then build and install dependencies with
rock build. Then run your instance with
rock run, and point callers to your local instance.
Add new features in branches and merge to master as late as possible
Push your feature branches up to origin to share with others, and use our build tools to deploy your branch to lower environments. Once you’re ready to push to production, merge into master and go for it. A merge to master should be treated as a deployment to production.
Write unit tests for all functionality
Before you add functionality, add a failing test that will succeed when your work is done. Aim for full code coverage. Before you commit, run all tests to make sure you didn’t break other tests. Fix any broken tests you find, even if you weren’t the one who broke them. Mock your test data rather than interacting with a test data store.
Write ntf acceptance tests for every resource
Write acceptance tests for your resources with ntf. ntf continually executes requests against our services in production and tests that responses match what’s expected. Performance data is trended as well. Tests in ntf should be able to return in a matter of seconds (say, less than 10), and be okay to run thousands of times a day.
Make it easy to set up and monitor
Add useful messages at higher-verbosity log levels. Avoid lengthy start-up times (>30s). Don’t require pre-run setup scripts. Fail with descriptive error messages, not just representative HTTP status codes.
Aim to be faster than any previous implementation in production. Make two calls and measure the difference in performance. This is also a chance to prove correctness—that the results you get from the service match the results powering production.
Monitor after you deploy
Watch access logs and error logs. Also watch real-time request volume graphs on Ground Control, and watch the status of ntf tests.
Document every resource
For each resource, state what problem it solves from the perspective of the consumer. Document request parameters. Please use plain, straightforward language. Whenever possible, include complete copy-and-pasteable working examples of requests and responses.
Be descriptive, but concise
Don’t say more than you need to. For example, rather than “this parameter specifies the width of the image”, you can simply say “image width”. Challenge yourself to boil it down to the core meaning of the thing.
Provide a resources resource
By convention, services provide an introspective meta resource at
/resources that lists resource endpoints along with supported request methods and documentation. Ground Control will consume
/resources and present your docs to the humans. Test and see how things look there.
Send descriptive error messages
Along with the appropriate HTTP status code, send a verbose, human-readable error message when an error has occurred. This can make all the difference to the developer writing client code to consume your resources. It is important that the developer be able to figure out whether he or she has made a mistake, or whether (and exactly how) the service is broken.
Consumers should anticipate failure
If a particular resource is unavailable, often the client may still be able to recover and serve a useful (if degraded) response to the end user. For example, we show the number of approved video clips on the footage site home page. If that number is not available from the media service, we’d rather show the home page without the number than send a 500 response. So in this case we can degrade to a message which just doesn’t reference that number.