Show character with Blameless Postmortems (part one)
Posted Mar 16, 2022 | 7 min. (1310 words)This is Part 1 of a two-part series on Blameless Postmortems. Today, we’ll discuss why blameless postmortems are so important and their implications for your team; the second part will go into detail on how to set them up as a process and make them successful.
Somebody wise may have once told you that how we handle adversity shows our character. Being able to acknowledge and admit mistakes is the first step towards learning — it’s a key part of success both in personal relationships and in large companies.
The way we handle mistakes as a culture is often the ceiling of the organization, and how we react to failure is a litmus test for dysfunction. By now most of us are familiar with the “pathological organization” in the famous Westrum study, where messengers are shot and failure leads to scapegoating. In these types of organizations, when something goes bump in the night, team leads immediately feel an icy wave of fear: OK, who screwed up this time? How can I blame this on someone else?
It’s immensely satisfying to buy into the “bad apple theory” and make an example of someone. This is the currency of the realm when it comes to pathological organizations, where points are scored by pinning blame on another fiefdom and avoiding accountability. Often this leads to “zero defect” meetings where a few scapegoat engineers are dragged before an inquisition, shamed and blamed for their role in an outage.
Perhaps worse than the storm and the fury of these blame sessions is silence. After one very embarrassing lapse at a large insurance company that led to our customers’ personal data being exposed, I remember most of all - after a week or so of closed door meetings - absolutely nothing being done. There was no open discussion of what happened and what would be done next time to prevent a reoccurrence. Small wonder that Patrick Lencioni wrote that one of the most important dysfunctions he sees in organizations is both “inattention to results”, and “avoidance of accountability”.
If we as people find it hard to admit mistakes, perhaps it shouldn’t shock us that this tendency to avoid or shift blame is so common at larger companies. Yet it is surprising that so many companies are touting their advances in release management and CI/CD - while the single biggest opportunity for improvement, one that costs almost nothing and can be implemented on a Friday afternoon, goes unspoken and ignored. At a recent conference, we asked almost 100 people if their company did blameless postmortems. Only two hands went up!
Enter the Blameless Postmortem
Good organizations dig deeper in how they handle outages and mistakes. They hold a postmortem - an open discussion of what happened, what and who was impacted, and then some simple steps that could help prevent the issue next time. But something’s odd about these postmortem sessions - we don’t see the poor engineer or operator being dragged before a tribunal to confess their sins. Instead, they’re openly sharing both what they knew at the time and what would have helped them make better decisions or made the system safer to interact with. What’s going on here? Why aren’t people being held accountable for screwing up?
Going back to that classic gauntlet of blame and punishment, you might think that this kind of scrutiny would lead to better, smarter behavior. In fact, it creates a vicious cycle where engineers become defensive, reluctant to describe openly what happened, obscuring what could lead to a quicker diagnosis or better process next time. And it has the worst impacts on the development process - who wants to risk losing their job over every minor change if the threat of punishment and public embarrassment is so drastic?
Interestingly, Google made blameless postmortems a fundamental part of their company’s SRE movement. As the Site Reliability Engineering book states:
“A blamelessly written postmortem assumes that everyone involved in an incident had good intentions and did the right thing with the information they had. If a culture of finger-pointing and shaming individuals or teams for doing the “wrong” thing prevails, people will not bring issues to light for fear of punishment.
… when postmortems shift from allocating blame to investigating the systematic reasons why an individual or team had incomplete or incorrect information, effective prevention plans can be put in place. You can’t ‘fix’ people, but you can fix systems and processes to better support people making the right choices when designing and maintaining complex systems. When an outage does occur, a postmortem is not written as a formality to be forgotten. Instead, the postmortem is seen by engineers as an opportunity to not only fix a weakness, but to make Google more resilient as a whole.”
Far from preventing accountability, blameless postmortems may be the best way of maximizing a team’s effectiveness. In a recent 5-year study, Google found that high performing teams had five common characteristics – the most important of which was psychological safety. That’s an environment that is judgment-free, where mistakes can be learned from instead of punished.
John Allspaw describes why moving beyond shame and blame helped Etsy grow and learn as a company:
“…We want the engineer who has made an error to give details about why (either explicitly or implicitly) he or she did what they did; why the action made sense to them at the time. This is paramount to understanding the pathology of the failure. The action made sense to the person at the time they took it, because if it hadn’t made sense to them at the time, they wouldn’t have taken the action in the first place.”
This shifts the emphasis from the person to the circumstances, which are something that we can learn from and adjust. It’s also operating under the assumption that your people are doing the best they can with the information available to them.
Allspaw continues:
“Blameless culture originated in the healthcare and avionics industries where mistakes can be fatal. These industries nurture an environment where every “mistake” is seen as an opportunity to strengthen the system. When postmortems shift from allocating blame to investigating the systematic reasons why an individual or team had incomplete or incorrect information, effective prevention plans can be put in place. You can’t “fix” people, but you can fix systems and processes to better support people making the right choices when designing and maintaining complex systems.
…If we go with “blame” as the predominant approach, then we’re implicitly accepting that deterrence is how organizations become safer. This is founded in the belief that individuals, not situations, cause errors. It’s also aligned with the idea there has to be some fear that not doing one’s job correctly could lead to punishment. Because the fear of punishment will motivate people to act correctly in the future. Right?”
A few years ago, working for a company that was very much blame-focused and highly political, scheduling a blameless postmortem felt very much like career suicide. In a culture where incidents were either swept under the rug or used for political points, how would a meeting reviewing failure go over? I walked into the meeting with a feeling of intense dread and foreboding. Two hours later, I walked out beaming - we had a good action plan, the stakeholders expressed how thrilled they were to be involved in understanding the event and proposing some helpful fixes. We had made a huge leap away from political infighting and towards improving our quality of life.
In every other instance since then, postmortems have been remarkably effective in breaking down walls and improving collaboration. It works in almost every type of company, and creates more positive change than anything else I’ve tried. When it comes to building DevOps culture, a blameless postmortem process should be considered “first base”.