Production incidents test an engineering manager's leadership like few other situations. When systems go down and pressure mounts, your team looks to you for calm direction and clear decision-making. This guide covers how to lead effectively during an incident, communicate with stakeholders, and build a culture that learns from failure rather than assigning blame.
Leading the Incident Response
Your role during an incident is not to debug the code - it is to coordinate the response. Designate an incident commander if one is not already assigned, ensure the right engineers are engaged, and remove blockers so the team can focus on resolution. Resist the urge to hover over engineers' shoulders or demand constant status updates, as this adds pressure without adding value.
Establish a clear communication cadence early. Decide who is updating stakeholders, how frequently, and through which channel. A single source of truth - typically a dedicated Slack channel or incident management tool - prevents conflicting information from spreading. Keep business stakeholders informed with non-technical summaries while the engineering team works in their own channel.
Monitor the team's energy and stress levels. Incidents that drag on for hours can lead to fatigue-driven mistakes. Rotate engineers in and out, ensure people take breaks, and be prepared to call in additional support if the incident extends beyond a reasonable timeframe.
Communicating with Stakeholders During Incidents
Stakeholder communication during incidents requires a balance between transparency and avoiding unnecessary alarm. Provide factual updates on what is known, what is being done, and when the next update will arrive. Avoid speculation about root causes until you have enough information to be confident.
Tailor your communication to the audience. Executives want to know the business impact and estimated time to resolution. Product managers want to understand which features are affected. Customer support teams need talking points for affected users. Preparing these different perspectives in advance - as part of your incident communication template - saves time during the actual event.
Running Blameless Post-Mortems
The post-mortem is where the real value of an incident emerges. Schedule it within a few days of resolution while memories are fresh, but not so immediately that emotions are still running high. The goal is to understand what happened, why it happened, and what systemic changes will prevent recurrence.
Blameless does not mean accountability-free. It means focusing on systems and processes rather than individual failures. Ask 'What made it possible for this to happen?' rather than 'Who made this mistake?' When individuals did make errors, examine what conditions - lack of documentation, inadequate testing, time pressure - contributed to the error.
Document actionable follow-up items with clear owners and deadlines. A post-mortem that identifies problems but generates no action items is a wasted exercise. Track these items to completion and review them in subsequent incidents to verify that the fixes were effective.
Building a Healthy Incident Culture
The way you respond to incidents shapes your team's willingness to surface problems early. If engineers fear punishment for causing outages, they will hide issues and avoid taking risks. If they see incidents handled calmly and post-mortems conducted fairly, they will be more proactive about raising concerns before they escalate.
Celebrate good incident response alongside celebrating incident prevention. Recognise engineers who communicated clearly during an outage, who identified the root cause quickly, or who wrote thorough post-mortems. This reinforces the behaviours you want to see.
Long-Term Incident Prevention Strategies
Use incident data to identify patterns. Are most incidents caused by deployment issues? Invest in deployment tooling and canary releases. Are they caused by configuration changes? Build validation and rollback mechanisms. Data-driven prioritisation ensures your prevention efforts target the highest-impact areas.
Invest in observability and alerting. Teams that detect issues before customers report them resolve incidents faster and with less business impact. Ensure your monitoring covers not just system health but business-critical user journeys.
Run regular game days or chaos engineering exercises to test your team's incident response muscles. Practising in a controlled environment builds the reflexes and confidence needed when a real incident strikes.
Key Takeaways
- Lead incident response by coordinating, communicating, and removing blockers - not by debugging
- Establish a single source of truth and clear communication cadence for stakeholders
- Run blameless post-mortems focused on systemic causes, with actionable follow-up items
- Build an incident culture where surfacing problems early is rewarded, not punished
- Use incident data to drive long-term prevention investments in tooling and observability
Frequently Asked Questions
- Should engineering managers be on call?
- It depends on your organisation's structure and team size. In smaller teams, managers may need to be in the on-call rotation. In larger organisations, the manager's role during incidents is typically coordination and communication rather than debugging. Regardless, you should be reachable during severe incidents and prepared to take on the incident commander role when needed.
- How do I handle an incident caused by a junior engineer's mistake?
- Treat it as a learning opportunity, not a disciplinary issue. In the post-mortem, focus on what systemic gaps allowed the mistake to reach production - missing code review, inadequate testing, or lack of guardrails. If a junior engineer can cause a major outage with a single change, the problem is your system, not the engineer.
- How often should we review our incident response process?
- Review your incident response process quarterly, or after any incident where the response itself was significantly flawed. Look for patterns across multiple post-mortems - if the same systemic issues keep appearing, your follow-up process needs strengthening. Annual tabletop exercises or game days also help identify gaps before real incidents expose them.
Get Incident Management Templates
Access ready-to-use incident response templates, post-mortem formats, and communication plans for engineering teams.
Learn More