I’ve messed up many times, as has everyone else on my team. There have been outages, oversaturated content, and frustrated users. We’ve put out a lot of fires. And we’ve done a lot of postmortems afterward.
“What’s a postmortem?”
I’m delighted you asked! Postmortems, also referred to less deathly as “root cause analysis,” are a way of retrospecting on a particular incident or failure. They help to better understand what went wrong and – crucially – how it can be avoided in the future. We’ve found that a few guidelines are necessary for getting maximum value from postmortems:
- Postmortems are blameless. Nobody is singled out for fault.
- Keep the focus on discovering the root causes of the problem, which when you go deep enough are almost always process failures, e.g. we didn’t tell the other team about the upcoming deployment, we didn’t have the right monitoring and alerts, etc.
- Generate concrete actions you can take to prevent the problem from happening again, e.g. add communication with other teams to the deployment checklist, add X monitoring and Y alerts, and so on.
I’m not going to talk about why it’s so important to do postmortems or even specifically how to do them because there are already a lot of great posts out there on the topic. Like Jeff Atwood says, “I don’t think it matters how you conduct the postmortem, as long as you do it.”
As those links suggest, postmortems can also be applied to software development projects in general; there doesn’t need to be a smoldering crater for you to sit down and examine what could be done differently. But I’d like to talk about postmortems specifically in the context of production incidents. That’s when things get crazy stressful, and having a good and reliable process can really help you get the most out of the experience.
#1 Take notes as you put out the fire.
When it comes to the investigation and retrospectives following a big emergency, I can’t stress enough how important it is to have a timeline of what happened and when. Your timeline forms the backbone of all the other facts and conclusions you gather about what happened and why. As soon as the incident ends, your recollection of everything that happened, and in what order, and why you took the actions you did might already start getting fuzzy, nevermind at the next morning’s stand-up.
“Pshh, I can get metrics on when the failure happened from my logs and graphs later.”
You’re totally right. Metrics have no place in the timeline. The timeline should be about what you thought was happening and the actions you took at a given time. Here’s an example about how they might play out in real time:
Alex – got an alert for high rep lag, investigating
Sidharth – can’t connect to the site from outside the VPN
Alex – database is hosed, doing a failover
Alex – failover complete
Sidharth – site looks back up
Wait, this looks eerily familiar. Almost like it’s happening in Slack, or HipChat, or whatever your team decided had the best emojis. And why not? In an emergency, you’ll also want to tell the rest of your team and the company at large about what’s going on in real time. Any team collaboration tool worth its salt will keep these messages archived with timestamps, so as long as you just type what’s happening into a text box you’ll get both great communication at the moment and a record of it within your timeline. This historical record will soon be your best friend as you try to reconstruct and learn from these unfolding events.
#2 Who has point? Who has comms? Define your roles.
Imagine your second-on-call jumps online with you to help fight the fire. Soon you’re both in the zone and, uh oh, nobody’s been posting status updates for the last 15 minutes. Or, whoops, your buddy already ran that restart command – wasn’t it obvious you were planning to?
In the heat of a production outage, it’s easy to lose track of who should be working on what. Your instinct is to get the problem fixed ASAP, whatever it takes. But these aren’t times for heroes to rise up and try to save everything alone. It’ll lead to confusion and even doubling up of work. And because people aren’t always in the same physical location when these kinds of crises arise, it’s more challenging to delegate and report back.
What helps is to have clearly defined roles: the first engineer, let’s call her “point”, runs all the agreed upon commands; the second, let’s call him “comms,” messages out the important updates. Agree on this process as a team before an incident happens. Once an incident starts, decide who’s serving which role the instant you start responding to it.
“I don’t have time for this process – I have to get the site back up!”
Okay, take a deep breath. There’s a reason EMTs don’t run. When you panic, you make shortsighted decisions. By sticking to your assigned roles, you’ll not only avoid confusion but also give people (both those working on the solution, and those waiting for resolution) a calming sense of purpose.
#3 Do follow-up analysis before the data – and your passion – evaporates.
“Phew, the site is back up. Gonna get some sleep, I’ll fill in the gaps in the timeline tomorrow.”
Nope, you won’t remember. Or your coworker will dispute your account. Do it right away while it’s fresh.
But what should you do the next day?
Hopefully, your service’s request logs are getting shipped into HDFS or some other kind of log afterlife, where you can dig them up to do analysis or replay them weeks after the fact. But storage is finite and not all logs go to heaven. System logs on an edge node might get truncated after a week or even a day – but maybe they hold the key to your understanding of what went wrong.
The morning after an incident, make sure someone owns the task of taking a “snapshot” of all the logs that might be relevant. Not sure which logs? Document all the logs that might matter ahead of time.
The same can apply to graphs of time series data. Systems like Graphite will naturally lose granularity the further you look in the past. If your postmortem links to these graphs directly, a curious observer might click through a month later only to find unhelpful polygons where your beautifully rendered latency spikes used to be. To avoid this, take screenshots of the important graphs and share them along with the links.
Another reason for gathering up all your data immediately is pretty simple. You want to learn as much as you can about the incident while it’s still fresh in people’s minds, and while passion about the problem is high. Don’t shrug this off as a backlog item to follow up on, because new problems and priorities are bound to arrive. But at the same time…
#4 Know when to give up.
A teammate and I once got alerted to our service cluster spiking latency and error rate. We investigated and realized it was simply not going to recover while serving traffic. We triggered a failover to a secondary cluster so that the first could be cleanly restarted. It looked like things had stabilized, but then latency and errors spiked again – the secondary cluster was falling over too. As soon as the primary cluster was back up, we disabled the failover, but then it again experienced a latency spike. Starting to sweat, we half-re-enabled the failover, sending partial traffic to either cluster, until we were finally able to resolve the problem.
Confused? So was I. The next day I sat down to finalize the timeline of this incident. The problem was it was super perplexing around the time we had done our failover square dance. We had stopped posting updates at that point, and for all my digging into access logs and system logs I couldn’t definitively figure out the order in which certain things happened. At some point, I realized I had spent an hour trying to reconstruct what had been about 30 minutes in real time. I craved an explanation and some strong lessons we could learn, but sometimes the universe just doesn’t allow for them.
I asked myself then, what was the value of me knowing exactly when we were failed over, un-failed over, half failed over? It didn’t actually matter to the postmortem or its lessons, so I wrote a summary saying that things got confusing and we moved on.
#5 Share your postmortem in an accessible, standardized way (and have a tool to do this)
Whatever postmortem process you opt to use – maybe it’s a big meeting with execs and other teams, or maybe it’s just you and your thoughts – make sure to write up the facts and conclusions that you come up with. By writing everything up in an organized and timely fashion, you’ll have a historical account that gets squirreled away somewhere so you don’t have to rely on oral tradition down the line when your team needs to recall those lessons and prevent repeating the same mistakes.
It’s also helpful to share this write-up with the larger organization. People outside your orbit will benefit from your experience, and sometimes you’ll even get suggestions or lessons learned you didn’t think of.
Like most publications, it helps to have a standard format. Everyone will bring a different style to postmortem write-ups, but they should all follow the same flow and include your agreed upon required fields. Some suggested fields would be:
- A summary of what happened and the impact
- The timeline of the incident
- Graphs showing the issue
- Root causes (at Shutterstock we do the 5 Whys)
- Lessons and action items
Your format can evolve over time as your company’s needs change, so be willing to add more stakeholders and fields as more attention is paid to your postmortems.
“Ugh, too much bureaucracy…”
You’re right. No one is going to “fill out a report” unless they’re forced to. That’s why your process needs to be something people will buy into. But as developers at Shutterstock, we knew that to get other devs really excited about this prospect, we should design a tool to make things simpler and better.
Splinter, a tool we developed in-house during a quarterly hackathon, provides us with a standard form to fill out for every new postmortem, and then an easy way to share it once it’s ready for circulation. It’s intended to make postmortems hassle-free so there are no excuses for not doing one. There are a number of third-party tools out there that fit this use case too if you’re so inclined. It’s your call.
We’re hiring engineers. Check out our Jobs page for available positions.