How do I create a blameless culture when leadership wants someone to blame?

Educate leadership on why blame is counterproductive. Blaming individuals leads to information hiding, which prevents the organisation from identifying and fixing systemic issues. Present the alternative: blameless post-mortems that produce specific, actionable improvements. Share examples of how systemic fixes prevent entire categories of incidents. Over time, the visible improvement in reliability makes the case for blamelessness.

How do I handle incidents that are caused by human error?

Reframe human error as a systemic issue. If a human can make a mistake that causes a production incident, the system is not sufficiently protected against that mistake. Focus on building safeguards - automation that prevents dangerous actions, guardrails that catch mistakes before they reach production, and processes that require verification for high-risk changes. The question is not 'who made the mistake?' but 'why did the system allow this mistake to have this impact?'

How many post-mortems should we write?

Write post-mortems for every incident above a minimum severity threshold and for any incident that reveals a previously unknown systemic risk, regardless of severity. Do not write post-mortems for every minor alert or transient issue - this creates fatigue and dilutes the value of the process. The goal is to capture and act on the learnings from significant incidents, not to generate documents.

Incident Response Framework for Engineering Teams

The alerts fire. Three engineers jump into a Slack channel. Nobody declares an incident. Nobody assigns roles. Two people are debugging the same service while a third is looking at the wrong dashboard. The VP is asking for a status update you do not have. Twenty minutes in, you still do not know the blast radius. Incidents are inevitable -- the chaos is not. The difference between a 20-minute resolution and a 3-hour fire drill is the framework you built before things went wrong.

Establishing an Incident Response Framework

An incident response framework provides structure during the chaos of a production outage. Define severity levels with clear criteria - what constitutes a SEV1 versus a SEV3? Define roles: incident commander (coordinates the response), technical lead (drives diagnosis and resolution), and communications lead (manages stakeholder updates). These roles should be clearly assigned within the first few minutes of an incident.

Document your incident process step by step: how incidents are declared, how severity is assessed, how responders are assembled, how communication channels are set up, and how the incident is resolved and closed. This documentation should be accessible during an incident - a runbook that no one can find when the systems are down is useless.

Practice your incident response regularly. Tabletop exercises, game day simulations, and chaos engineering experiments test your process under controlled conditions. These practices reveal gaps in runbooks, communication plans, and tooling before a real incident exposes them under pressure.

Define clear severity levels with specific criteria for each level
Assign incident roles - commander, technical lead, communications lead - within the first minutes
Document the entire incident process in an accessible, well-known location
Practice incident response through tabletop exercises and game day simulations

Leading During Active Incidents

During an active incident, your primary role as the engineering manager is to ensure the response is coordinated, resources are available, and communication is flowing. Resist the temptation to take over the technical investigation - your engineers are better positioned for this. Instead, focus on enabling them to work effectively.

Establish a single communication channel for the incident response team and a separate channel for stakeholder updates. Mixing operational discussion with stakeholder communication creates confusion and slows both. Update stakeholders on a regular cadence - every 30 minutes for high-severity incidents - even if the update is 'still investigating.'

Make clear decisions about trade-offs during the incident. Sometimes the fastest resolution involves a risky change, a temporary workaround, or degraded functionality. The incident commander should have the authority to make these decisions without lengthy debates. After the incident, evaluate whether the trade-offs were appropriate - during the incident, speed of decision-making is paramount.

Conducting Blameless Post-Mortems

Blameless post-mortems are the most valuable output of any incident. They transform a negative event into systemic improvements that prevent recurrence. The blameless element is critical - if engineers fear blame, they will hide information, and the organisation loses the opportunity to learn.

Structure post-mortems around what happened (timeline), why it happened (root cause analysis), what the impact was (scope, duration, user impact), what went well in the response, what could have been better, and what actions will prevent recurrence. Publish post-mortems broadly - transparency about failures builds trust and helps other teams learn.

Track post-mortem action items rigorously. The most common failure mode for post-mortems is generating action items that are never completed. Assign owners, set deadlines, and review progress in your regular team meetings. An incident that generates insights but no lasting changes is a missed opportunity.

Building Organisational Incident Readiness

Incident readiness is not a state you achieve but a capability you continuously build. Invest in observability so that incidents are detected automatically rather than reported by users. Invest in runbooks so that common incidents can be resolved quickly by any on-call engineer. Invest in automation so that recovery actions are fast and reliable.

Build a learning culture around incidents. Share post-mortems at engineering all-hands, maintain an incident knowledge base that is searchable, and incorporate incident learnings into new engineer onboarding. The organisation's collective incident response capability improves when learnings are shared broadly.

Track incident metrics to measure improvement: mean time to detect (MTTD), mean time to resolve (MTTR), incident frequency, and the percentage of incidents caused by previously identified categories. These metrics should trend favourably over time - if they are not, your investment in prevention and readiness is insufficient.

Key Takeaways

Establish a clear incident framework with severity levels, defined roles, and documented processes
During incidents, focus on coordination, communication, and enabling your engineers rather than taking over
Conduct blameless post-mortems and rigorously track action items to prevent recurrence
Build incident readiness through observability, runbooks, and regular practice
Track incident metrics over time and use data to guide investment in prevention and readiness

Frequently Asked Questions

How do I create a blameless culture when leadership wants someone to blame?: Educate leadership on why blame is counterproductive. Blaming individuals leads to information hiding, which prevents the organisation from identifying and fixing systemic issues. Present the alternative: blameless post-mortems that produce specific, actionable improvements. Share examples of how systemic fixes prevent entire categories of incidents. Over time, the visible improvement in reliability makes the case for blamelessness.
How do I handle incidents that are caused by human error?: Reframe human error as a systemic issue. If a human can make a mistake that causes a production incident, the system is not sufficiently protected against that mistake. Focus on building safeguards - automation that prevents dangerous actions, guardrails that catch mistakes before they reach production, and processes that require verification for high-risk changes. The question is not 'who made the mistake?' but 'why did the system allow this mistake to have this impact?'
How many post-mortems should we write?: Write post-mortems for every incident above a minimum severity threshold and for any incident that reveals a previously unknown systemic risk, regardless of severity. Do not write post-mortems for every minor alert or transient issue - this creates fatigue and dilutes the value of the process. The goal is to capture and act on the learnings from significant incidents, not to generate documents.

Download Incident Management Templates

Access my incident management templates including post-mortem frameworks, incident commander checklists, and severity level definitions for engineering teams.

Learn More

Incident Response Framework for Engineering Teams

Establishing an Incident Response Framework

Leading During Active Incidents

Conducting Blameless Post-Mortems

Building Organisational Incident Readiness

Key Takeaways

Frequently Asked Questions

Download Incident Management Templates

Related Articles

Engineering Documentation: How to Build a Docs Culture

Raising Engineering Quality Standards on Your Team

Engineering Standards: Consistency Without Bureaucracy

Engineering Innovation Time: How to Protect and Scale It

Engineering Career Ladders: How to Build and Use Them

How to Manage an Underperforming Engineer