What incidents should trigger a postmortem?

At minimum, any incident that affected customers or caused significant internal disruption should trigger a postmortem. Many teams also run postmortems for near-misses - incidents that were caught before causing damage but could have been serious. Some organisations run postmortems for all incidents above a certain severity level. The threshold should be low enough to capture important learning opportunities but high enough to avoid postmortem fatigue.

How do you keep postmortems blameless when someone clearly made a mistake?

Reframe the question from 'who made a mistake?' to 'what about our system allowed this mistake to have this impact?' If an engineer deployed untested code, the postmortem should ask: why was it possible to deploy untested code? where were the automated checks? what about the deployment process encouraged skipping tests? The individual's action is one link in a chain of systemic factors. Address the systemic factors and you prevent the entire class of errors, not just the specific one.

How do you handle postmortem action items that never get completed?

Incomplete actions indicate either that they were not important enough (in which case, close them), too large to be practical (in which case, break them down), or deprioritised in favour of feature work (in which case, escalate the prioritisation decision). Track action completion rates as a metric and raise it in team retrospectives and management reviews. If the organisation consistently deprioritises postmortem actions, escalate the risk this creates to senior leadership with data on incident recurrence.

Blameless Postmortem Guide for Engineering Teams

In a blame culture, people hide information during postmortems. They minimise their involvement, shift responsibility, and the real causes go undiscovered. The same incidents keep happening. Blameless postmortems break that cycle by separating what happened from who did it - creating enough safety for your team to be genuinely honest about the systemic gaps that made the incident possible. This guide shows you how to establish and maintain that safety while still driving accountability for improvement.

Why Blamelessness Is Essential

The case for blameless postmortems is both moral and practical. Morally, in complex systems, incidents are almost never caused by a single person's mistake - they result from the interaction of multiple factors including system design, process gaps, tooling deficiencies, and organisational pressures. Blaming an individual for a systemic failure is unfair and inaccurate.

Practically, blame kills learning. When people fear punishment, they hide information, minimise their involvement, and avoid volunteering for incident response. In a blame culture, the postmortem becomes a political exercise where everyone tries to shift responsibility rather than a genuine investigation into what went wrong. The result is that the real causes go unaddressed, and the same types of incidents recur.

Blamelessness does not mean that nobody is accountable. It means that the postmortem separates the question of what happened and why from the question of individual performance. The postmortem addresses the systemic factors; any individual performance issues are handled separately through private management conversations. This distinction is crucial and must be communicated clearly to the team.

Complex system failures are systemic, not individual - blame is inaccurate and counterproductive
Blame suppresses information sharing and honest analysis of incidents
Blamelessness is not the same as lack of accountability - individual issues are handled separately
Teams that practise blameless postmortems report fewer repeat incidents over time
The engineering manager's behaviour during incidents sets the tone for the entire team's culture

The Postmortem Process Step by Step

Hold the postmortem within forty-eight hours of the incident while memories are fresh. Invite everyone who was involved in the incident - those who detected it, responded to it, and were affected by it. Also invite interested observers who can learn from the discussion. The meeting should last sixty to ninety minutes for significant incidents.

Structure the meeting in four phases: timeline reconstruction, contributing factor analysis, action item identification, and reflection. Start by building a detailed timeline of events - what happened, when, and what information was available at each decision point. This factual foundation prevents the discussion from devolving into competing narratives.

During the contributing factor analysis, ask why at each critical juncture. Not just 'why did the deployment fail?' but 'why did our testing not catch this? why did the deployment process allow this change to reach production? why did our monitoring not alert us sooner?' Each why leads to a deeper systemic factor. The goal is to identify the contributing factors that, if addressed, would prevent this type of incident from recurring.

Facilitating Effective Postmortem Discussions

The facilitator sets the tone for the postmortem. Begin by explicitly stating the blameless principle: 'We are here to understand what happened and how to prevent it, not to assign blame. Everyone involved made the best decisions they could with the information available at the time.' This framing should be repeated whenever the discussion veers toward individual criticism.

Ask open-ended questions that encourage systemic thinking: 'What information would have helped you make a different decision?' rather than 'Why did you not check the logs?' The first question explores the system; the second implies individual failure. When someone describes an action they took, ask about the context: 'What were you seeing at that point? What options did you consider? What constraints were you working under?'

Watch for hindsight bias - the tendency to view past decisions as obviously wrong because we now know the outcome. During the incident, the responders had incomplete information and were under time pressure. The facilitator should regularly remind the group of what was and was not known at each point in the timeline, preventing unfair judgement of decisions made under uncertainty.

Turning Postmortem Insights into Lasting Change

Every postmortem should produce three to five specific, actionable items with clear owners and due dates. Avoid vague actions like 'improve monitoring' - instead, specify 'add latency alerting to the payment service with a threshold of 500ms, owned by Sarah, due in two weeks.' Specific actions are more likely to be completed and more easily verified.

Categorise actions by type: immediate fixes (changes needed before the next deployment), short-term improvements (changes that can be completed within one or two sprints), and systemic investments (changes that require broader organisational support). The engineering manager's responsibility is to ensure short-term items enter the sprint and systemic items are escalated with appropriate urgency.

Track postmortem action completion as a team metric. If actions from six months ago remain incomplete, the postmortem process is generating insights but not driving change. Review outstanding postmortem actions in team meetings and retrospectives. When actions languish, investigate why - it often reveals capacity constraints, competing priorities, or actions that were too large to be practical.

Building a Learning Culture Around Incidents

Publish postmortem reports widely. When postmortems are visible across the organisation, they become a powerful learning resource. Other teams can learn from your incidents without experiencing them firsthand. A searchable archive of postmortems also helps new team members understand the system's failure modes and the reasoning behind certain architectural decisions.

Celebrate good incident response and thorough postmortems. When a team detects an incident quickly, responds effectively, and produces a postmortem that leads to systemic improvements, recognise that publicly. This reinforces the message that incidents are learning opportunities and that the postmortem process is valued.

Conduct periodic reviews of postmortem themes. Every quarter, review the last twelve postmortems and look for patterns. Are most incidents caused by deployment failures, configuration errors, dependency issues, or capacity problems? Patterns that span multiple incidents indicate systemic issues that individual postmortem actions may not address. These themes should inform broader engineering investment decisions.

Key Takeaways

Hold postmortems within forty-eight hours while memories are fresh and details are accurate
Separate systemic analysis (the postmortem) from individual performance discussions (private management)
Ask open-ended, context-seeking questions rather than accusatory ones
Produce three to five specific actions with owners and due dates - track completion as a team metric
Publish postmortem reports widely and review themes quarterly to identify systemic patterns

Frequently Asked Questions

What incidents should trigger a postmortem?: At minimum, any incident that affected customers or caused significant internal disruption should trigger a postmortem. Many teams also run postmortems for near-misses - incidents that were caught before causing damage but could have been serious. Some organisations run postmortems for all incidents above a certain severity level. The threshold should be low enough to capture important learning opportunities but high enough to avoid postmortem fatigue.
How do you keep postmortems blameless when someone clearly made a mistake?: Reframe the question from 'who made a mistake?' to 'what about our system allowed this mistake to have this impact?' If an engineer deployed untested code, the postmortem should ask: why was it possible to deploy untested code? where were the automated checks? what about the deployment process encouraged skipping tests? The individual's action is one link in a chain of systemic factors. Address the systemic factors and you prevent the entire class of errors, not just the specific one.
How do you handle postmortem action items that never get completed?: Incomplete actions indicate either that they were not important enough (in which case, close them), too large to be practical (in which case, break them down), or deprioritised in favour of feature work (in which case, escalate the prioritisation decision). Track action completion rates as a metric and raise it in team retrospectives and management reviews. If the organisation consistently deprioritises postmortem actions, escalate the risk this creates to senior leadership with data on incident recurrence.

Get the Blameless Postmortem Toolkit

Postmortem template, incident timeline builder, and action tracking dashboard to run your next blameless postmortem with confidence.

Learn More

Blameless Postmortem Guide for Engineering Teams

Why Blamelessness Is Essential

The Postmortem Process Step by Step

Facilitating Effective Postmortem Discussions

Turning Postmortem Insights into Lasting Change

Building a Learning Culture Around Incidents

Key Takeaways

Frequently Asked Questions

Get the Blameless Postmortem Toolkit

Related Articles

Engineering Capacity Planning Framework: Stop Overcommitting

Risk Management Framework for Engineering Projects

Change Management for Engineering Managers (Guide)

Conflict Resolution Framework for Engineering Teams

Feedback Framework for Engineering Managers (Guide)

Goal Setting Framework for Engineering Teams (Guide)