Should engineering managers be on call?

It depends on your organisation's structure and team size. In smaller teams, managers may need to be in the on-call rotation. In larger organisations, the manager's role during incidents is typically coordination and communication rather than debugging. Regardless, you should be reachable during severe incidents and prepared to take on the incident commander role when needed.

How do I handle an incident caused by a junior engineer's mistake?

Treat it as a learning opportunity, not a disciplinary issue. In the post-mortem, focus on what systemic gaps allowed the mistake to reach production - missing code review, inadequate testing, or lack of guardrails. If a junior engineer can cause a major outage with a single change, the problem is your system, not the engineer.

How often should we review our incident response process?

Review your incident response process quarterly, or after any incident where the response itself was significantly flawed. Look for patterns across multiple post-mortems - if the same systemic issues keep appearing, your follow-up process needs strengthening. Annual tabletop exercises or game days also help identify gaps before real incidents expose them.

Engineering Incident Management: A Leader's Playbook

PagerDuty just woke up three engineers. The CEO is asking what happened. Your team is scrambling to find the root cause while stakeholders demand ETAs you cannot give. Your job right now is not to debug the code. It is to coordinate the response, shield the team from noise, and communicate clearly enough that everyone stays calm. Here is how to lead through the incident and build the culture that prevents the next one.

Leading the Incident Response

Your role during an incident is not to debug the code - it is to coordinate the response. Designate an incident commander if one is not already assigned, ensure the right engineers are engaged, and remove blockers so the team can focus on resolution. Resist the urge to hover over engineers' shoulders or demand constant status updates, as this adds pressure without adding value.

Establish a clear communication cadence early. Decide who is updating stakeholders, how frequently, and through which channel. A single source of truth - typically a dedicated Slack channel or incident management tool - prevents conflicting information from spreading. Keep business stakeholders informed with non-technical summaries while the engineering team works in their own channel.

Monitor the team's energy and stress levels. Incidents that drag on for hours can lead to fatigue-driven mistakes. Rotate engineers in and out, ensure people take breaks, and be prepared to call in additional support if the incident extends beyond a reasonable timeframe.

Communicating with Stakeholders During Incidents

Stakeholder communication during incidents requires a balance between transparency and avoiding unnecessary alarm. Provide factual updates on what is known, what is being done, and when the next update will arrive. Avoid speculation about root causes until you have enough information to be confident.

Tailor your communication to the audience. Executives want to know the business impact and estimated time to resolution. Product managers want to understand which features are affected. Customer support teams need talking points for affected users. Preparing these different perspectives in advance - as part of your incident communication template - saves time during the actual event.

Running Blameless Post-Mortems

The post-mortem is where the real value of an incident emerges. Schedule it within a few days of resolution while memories are fresh, but not so immediately that emotions are still running high. The goal is to understand what happened, why it happened, and what systemic changes will prevent recurrence.

Blameless does not mean accountability-free. It means focusing on systems and processes rather than individual failures. Ask 'What made it possible for this to happen?' rather than 'Who made this mistake?' When individuals did make errors, examine what conditions - lack of documentation, inadequate testing, time pressure - contributed to the error.

Document actionable follow-up items with clear owners and deadlines. A post-mortem that identifies problems but generates no action items is a wasted exercise. Track these items to completion and review them in subsequent incidents to verify that the fixes were effective.

Building a Healthy Incident Culture

The way you respond to incidents shapes your team's willingness to surface problems early. If engineers fear punishment for causing outages, they will hide issues and avoid taking risks. If they see incidents handled calmly and post-mortems conducted fairly, they will be more proactive about raising concerns before they escalate.

Celebrate good incident response alongside celebrating incident prevention. Recognise engineers who communicated clearly during an outage, who identified the root cause quickly, or who wrote thorough post-mortems. This reinforces the behaviours you want to see.

Long-Term Incident Prevention Strategies

Use incident data to identify patterns. Are most incidents caused by deployment issues? Invest in deployment tooling and canary releases. Are they caused by configuration changes? Build validation and rollback mechanisms. Data-driven prioritisation ensures your prevention efforts target the highest-impact areas.

Invest in observability and alerting. Teams that detect issues before customers report them resolve incidents faster and with less business impact. Ensure your monitoring covers not just system health but business-critical user journeys.

Run regular game days or chaos engineering exercises to test your team's incident response muscles. Practising in a controlled environment builds the reflexes and confidence needed when a real incident strikes.

Key Takeaways

Lead incident response by coordinating, communicating, and removing blockers - not by debugging
Establish a single source of truth and clear communication cadence for stakeholders
Run blameless post-mortems focused on systemic causes, with actionable follow-up items
Build an incident culture where surfacing problems early is rewarded, not punished
Use incident data to drive long-term prevention investments in tooling and observability

Frequently Asked Questions

Should engineering managers be on call?: It depends on your organisation's structure and team size. In smaller teams, managers may need to be in the on-call rotation. In larger organisations, the manager's role during incidents is typically coordination and communication rather than debugging. Regardless, you should be reachable during severe incidents and prepared to take on the incident commander role when needed.
How do I handle an incident caused by a junior engineer's mistake?: Treat it as a learning opportunity, not a disciplinary issue. In the post-mortem, focus on what systemic gaps allowed the mistake to reach production - missing code review, inadequate testing, or lack of guardrails. If a junior engineer can cause a major outage with a single change, the problem is your system, not the engineer.
How often should we review our incident response process?: Review your incident response process quarterly, or after any incident where the response itself was significantly flawed. Look for patterns across multiple post-mortems - if the same systemic issues keep appearing, your follow-up process needs strengthening. Annual tabletop exercises or game days also help identify gaps before real incidents expose them.

Download the Incident Response Playbook

Ready-to-use templates for incident coordination, stakeholder updates, and blameless post-mortems that produce real follow-through.

Learn More

Engineering Incident Management: A Leader's Playbook

Leading the Incident Response

Communicating with Stakeholders During Incidents

Running Blameless Post-Mortems

Building a Healthy Incident Culture

Long-Term Incident Prevention Strategies

Key Takeaways

Frequently Asked Questions

Download the Incident Response Playbook

Related Articles

How to Lead After a Failed Engineering Project

Leading Engineering Teams Through Organizational Change

Engineering Team Productivity: Find and Fix Bottlenecks

Managing Delivery Pressure on Engineering Teams

Technical Debt vs. Features: How to Find the Balance

Code Review Best Practices for Engineering Teams