Suppose you’re a startup looking to grow into a mid-sized tech company — somewhere between 30 and 100 engineers. Hiring is happening fast, and the amount of code you’re churning out is growing fast as well. At an earlier stage, your company was focused on proving the product. Everything was smaller in scale and you were able to iterate fast. Now, as you grow, you have many more developers writing code and many more variables in the mix.
This is when you notice the quality of the product begins to deteriorate, and you can’t release code as fast as you’d hoped. In scaling the business there were a growing number of variables to juggle, and you might have glossed over the need to test more, and to spend more time on testing.
If you decide to hire a QA manager, who in turn brings aboard a bunch of automation engineers, you can get out of the woods. Selenium tests with a very high percentage of coverage. But then, over time, things begin to slow down again. All that automation and goodwill you built with selenium coverage starts to break and fail, and it repeatedly halts the software factory.
Where we started
When I joined Shutterstock, I was impressed by the amount of automated test coverage the company had. Almost every piece of functionality on the site had test coverage in the form of selenium end-to-end tests. Shutterstock had a development workflow in place through Jenkins that would block deploy to production if the selenium tests failed. I liked that; it meant no one could release anything into production unless all the tests passed.
But soon after, I realized that our company, which was releasing multiple times a day, had turned into a company that was now blocked from releasing for multiple days at a time, mainly because of failing selenium tests. Most often, the tests failed not because of a broken product, but because they were fragile.
A few things led up to this:
- End-to-end selenium-based acceptance tests became the only form of automated testing everyone depended on for testing. Many teams stopped even writing unit tests.
- The test framework, which was flaky, was built and owned solely by the QA team. When something failed and the entire software factory came to a halt, the burden of figuring out what went wrong fell on a small group of three to five people in the QA team, and they would often be blamed for slowing down the rest of the organization.
- The engineering organization spent a lot of time figuring out how to build a product that could scale, but not enough attention went into building a development workflow that would support such product development.
- Quality was owned solely by the QA team.
At our core, we had a QA organization that had not scaled with the rest of the organization. While they had the skills to automate everything, they lacked the core skills necessary to build a scalable test framework. Because of this gap, they were unable to influence the rest of the organization to think of quality as something that was owned by all, rather than just the QA team. To close this gap, we had to rethink our approach to QA as a whole.
Toward a new beginning
I wanted to accomplish two goals: First, to rebuild Shutterstock’s test infrastructure/frameworks to be more stable, and second, to change the engineering culture at Shutterstock to be one where quality was not just owned by the Test Engineering team, but rather by everyone.
We changed the core competencies we looked for when hiring Test engineers. We wanted our Test engineers to be strong developers who knew how to build object-oriented solutions that would help them create a stable and scalable test framework. We also wanted them to be influencers who could push their team to do the right thing and not take shortcuts such as skipping unit tests. Once we had built out a world-class Test engineering team, we began figuring out how to release fast while maintaining a high quality product.
We knew our largest problem was fragile tests, so we built a tool called Sagacity to record each test’s pass/fail data. We had all our tests push data into Sagacity each time they ran as a part of our Jenkins workflow. We then built a website on top of this database to make it easy to mine the data. We were now able to monitor pass rates for jobs, pass rates for individual tests, most commonly occurring failure messages, longest running tests, and more. Armed with this data, we could hold ourselves, and others, more accountable. One of our core teams impacted most by failing tests realized that their usual pass rate was just 20%. (Imagine how often the software factory came to a halt because of this roadblock.) Using Sagacity, they were able to quickly isolate the tests that had the lowest pass rate and see the common failure message in these tests. The team made simple fixes to the test script to improve its reliability.
The launch of Sagacity, coupled with the right set of test engineers championing our new testing culture across their teams, led to an almost immediate uptick in the weekly pass rates for our automation jobs — from 20% to 80% in some cases.
Sagacity allowed for a lot of quick wins because it gave developers access to actionable data on their test suite. But we still had one lingering problem: We were too dependent on a large number of end-to-end selenium-based UI tests as the only quality gate to release. We had a larger problem to solve, and that was how to build quality checks further upstream into the development process.
At Shutterstock, for a end-to-end UI test job to run and then give a result, a developer needs to make a code change, then merge it back to a branch that would kick off a build and deploy followed by acceptance tests. Given the large number of tests, the automation test run could take anywhere between 30 and 60 minutes before the developer gets any sort of signal back on whether he or she has inadvertently introduced a bug into the system. Multiply this by 100+ developers at Shutterstock and you realize we were spending a lot of time waiting for tests to run. And, if they failed, you would have to go back and repeat the whole process.
We knew giving developers more immediate feedback was a key step in speeding up our software factory. To do that, we needed to build quality into each step of the development workflow rather than as a step at the end. To accomplish that, we did the following:
- We integrated SonarQube with our development process so that each build would push unit test coverage data onto SonarQube. SonarQube helped us get a dashboard of unit test coverage across all the repos at Shutterstock. We put this dashboard up on the monitors in different areas to make everyone aware of their team’s unit test coverage. We saw teams that had less than 10% coverage respond quickly to bring their coverage numbers up so that they could compete with other teams.
- We introduced the concept of mid-level tests, which ran on a sandboxed version of our service/application with all the external dependencies mocked out. We were able to run these tests without requiring a deploy using technologies such as Docker. In addition, we could run the same selenium-based UI tests on the sandboxed version of the app, but in 1/10th the time. As such, developers could kick off mid-level tests and get pass/fail results back almost immediately.
- We used Drone and Docker to build a number of quality checks directly into every pull request. As soon as a pull request was created, the developer and the reviewer got immediate feedback on code coverage numbers, results of mid-level tests, and results of unit tests. We armed our developers with data about the quality of their code before any of their code would be merged.
- We reintroduced people across the engineering organization to the test pyramid. As a part of each feature, we embraced a healthy discussion about how it should be tested, where the right set of tests should live, and anything else that came to mind.
No matter how your company structures cadences for code releases, you need to emphasize quality in every step your development process. Companies spend large amounts of time working out how their software will be architected, but not enough time thinking about how quality will be built into the software; instead, they blindly automate everything. Your development workflow and how you build quality into each step of it is crucial in determining how fast you can iterate on your product.
At Shutterstock, the ability to release daily improvements to our product is a key differentiator between us and our competitors. We are always looking to improve our approach to software releases. By asking ourselves some difficult questions and assessing our behavior, we improved our workflow and built an engineering culture where everyone owns quality. Since making this change, we have done more than simply speed things up: We’ve seen firsthand how new ideas and innovations can radically change company culture for the better.
5 downloads/second, amazing tech, inspiring people. We’re hiring, apply now!