Tips for Effective Postmortems

Writing detailed and accurate postmortems allows you to learn quickly from mistakes and improve systems and processes for everyone. This guide lists some of the things we do to make sure our postmortems are effective.

Do's#

Dont's#

  • Don't use the word "outage" unless it really was an outage. Accurately reflect the impact of an incident. Outage is usually too broad a term to use. It can lead customers to think the product was fully unavailable when that likely was nowhere near the case.
  • Don’t change details or events to make things "look better." Be honest in postmortems, otherwise they lose their effectiveness.
  • Don’t name and shame someone. Keep postmortems blameless. If someone deployed a change that broke things, it's not their fault. Everyone is collectively responsible for building a system that allowed them to deploy a breaking change.
  • Avoid the concept of "human error." Very rarely is the mistake "rooted" in a human performing an action. There are often several contributing factors (the script the human ran didn't have rate limiting, the documentation was out of date, etc.) that can and should be addressed.
  • Don’t just point out what went wrong. Drill down to the underlying causes of the issue.