Tips for Effective Postmortems
Writing detailed and accurate postmortems allows you to learn quickly from mistakes and improve systems and processes for everyone. This guide lists some of the things we do to make sure our postmortems are effective.
- Make sure the timeline is an accurate representation of events.
- Define any technical lingo/acronyms you use that newcomers may not understand.
- Separate what happened from how to fix it.
- Write follow-up tasks that are actionable, specific, and bounded in scope.
- Discuss how the incident fits into our understanding of the health and resiliency of the services affected.
- Don't use the word "outage" unless it really was an outage. Accurately reflect the impact of an incident. Outage is usually too broad a term to use. It can lead customers to think the product was fully unavailable when that likely was nowhere near the case.
- Don’t change details or events to make things "look better." Be honest in postmortems, otherwise they lose their effectiveness.
- Don’t name and shame someone. Keep postmortems blameless. If someone deployed a change that broke things, it's not their fault. Everyone is collectively responsible for building a system that allowed them to deploy a breaking change.
- Avoid the concept of "human error." Very rarely is the mistake "rooted" in a human performing an action. There are often several contributing factors (the script the human ran didn't have rate limiting, the documentation was out of date, etc.) that can and should be addressed.
- Don’t just point out what went wrong. Drill down to the underlying causes of the issue.