Should engineering managers be on the on-call rotation?

Generally no, but you should be available as an escalation point. Your engineers should handle the technical response, while you handle coordination and communication when incidents escalate beyond routine resolution. Being on the primary on-call rotation can actually slow incident response because you may lack the hands-on debugging skills that your engineers use daily. However, you should absolutely be reachable during major incidents.

How do I balance incident response with planned work?

Build incident response time into your capacity planning. If your team typically spends ten to fifteen per cent of its time on incident response and operational tasks, account for that in your roadmap. When major incidents require significant follow-up work, adjust your planned commitments and communicate the impact to stakeholders. Pretending incidents do not affect your delivery timeline leads to missed commitments and eroded trust.

How do I prevent post-incident action items from being deprioritised?

Treat post-incident action items as first-class work items with the same visibility and tracking as feature work. Include them in your sprint planning, assign clear owners and deadlines, and review their status in your regular team meetings. If an action item keeps getting deprioritised, escalate the risk to your leadership - the organisation needs to understand that deferring incident prevention is accepting the risk of recurrence.

Incident Management: Respond and Learn Without Blame

You will never prevent every incident. What you can control is how fast your team responds, how calmly they communicate, and how much they learn afterwards. This guide covers the full loop - from building runbooks and on-call readiness in peacetime to coordinating response under pressure and running post-incident reviews that produce real improvements.

Your Role During an Incident

During an active incident, the engineering manager's role is coordination, not execution. You are not the person debugging the failing service or writing the hotfix - your engineers are. Your job is to ensure the right people are engaged, communication is flowing to stakeholders, and the team has everything they need to resolve the issue.

This means you are the incident commander or you are supporting whoever holds that role. You manage the communication channels, provide status updates to leadership and customer-facing teams, make decisions about escalation, and shield the responders from distractions. The worst thing an engineering manager can do during an incident is add to the chaos by asking for updates every five minutes or suggesting solutions without context.

Coordinate communication, do not debug - let your engineers focus
Provide regular status updates to stakeholders and leadership
Make escalation decisions based on impact and duration
Shield the responders from unnecessary interruptions

Building Incident Readiness

Incident readiness is built in peacetime, not during a crisis. Ensure your team has documented runbooks for common failure scenarios, a clear on-call rotation with defined escalation paths, and alerting that is calibrated to catch real problems without creating alert fatigue.

Run regular incident simulations or game days where your team practises responding to realistic failure scenarios. These exercises build muscle memory and reveal gaps in your runbooks, tooling, and communication processes. A team that has practised incident response handles real incidents with significantly less stress and faster resolution times.

Define severity levels and response expectations clearly. Your team should know exactly what constitutes a P1 versus a P2 incident, who needs to be paged for each severity level, and what the expected response time is. Ambiguity in these definitions leads to under-response for serious incidents and over-response for minor ones.

Running Effective Post-Incident Reviews

Post-incident reviews - sometimes called retrospectives or post-mortems - are where the real value of incident management lies. Every significant incident should be followed by a structured review within a week of resolution. The purpose is not to assign blame but to understand what happened, why, and what can be done to prevent similar incidents in the future.

A strong post-incident review follows a blameless methodology. It focuses on system failures and process gaps rather than individual mistakes. When people feel safe admitting errors, you get more accurate information and better action items. When people fear blame, they hide information and the organisation fails to learn.

Ensure that action items from post-incident reviews are tracked and completed. The most common failure in incident management is conducting thorough reviews but never following through on the improvements. Assign owners and deadlines to every action item, and review progress in your regular team meetings.

Building a Culture of Reliability

Reliability is a cultural value, not just a technical one. As an engineering manager, you shape this culture through the decisions you make and the behaviours you reward. When you prioritise production stability alongside feature delivery, your team learns that reliability matters. When you consistently defer reliability work in favour of features, they learn the opposite.

Celebrate reliability wins as visibly as you celebrate feature launches. When an engineer identifies a potential failure mode and fixes it before it causes an incident, that deserves the same recognition as shipping a new feature. This reinforcement builds the proactive mindset that prevents incidents in the first place.

Common Incident Management Mistakes

The most damaging mistake is treating incidents as someone else's problem. If your team builds and operates a service, incident management is your responsibility. Delegating it entirely to an operations or SRE team creates a disconnect between those who build the software and those who deal with its failures.

Another common error is conducting post-incident reviews that devolve into blame sessions. Once your team learns that admitting mistakes leads to punishment, they will stop surfacing problems and your incident reviews will become performative exercises that produce no real improvement.

Finally, many engineering managers fail to invest in incident tooling and processes until after a major outage. By then, the organisation is in crisis mode and the investment feels like damage control rather than proactive improvement. Build your incident management capability before you need it.

Key Takeaways

Your role during incidents is coordination and communication, not debugging
Build incident readiness through runbooks, on-call rotations, and regular simulations
Run blameless post-incident reviews and track action items to completion
Celebrate reliability wins to build a proactive reliability culture
Invest in incident management tooling and processes before a major outage forces you to

Frequently Asked Questions

Should engineering managers be on the on-call rotation?: Generally no, but you should be available as an escalation point. Your engineers should handle the technical response, while you handle coordination and communication when incidents escalate beyond routine resolution. Being on the primary on-call rotation can actually slow incident response because you may lack the hands-on debugging skills that your engineers use daily. However, you should absolutely be reachable during major incidents.
How do I balance incident response with planned work?: Build incident response time into your capacity planning. If your team typically spends ten to fifteen per cent of its time on incident response and operational tasks, account for that in your roadmap. When major incidents require significant follow-up work, adjust your planned commitments and communicate the impact to stakeholders. Pretending incidents do not affect your delivery timeline leads to missed commitments and eroded trust.
How do I prevent post-incident action items from being deprioritised?: Treat post-incident action items as first-class work items with the same visibility and tracking as feature work. Include them in your sprint planning, assign clear owners and deadlines, and review their status in your regular team meetings. If an action item keeps getting deprioritised, escalate the risk to your leadership - the organisation needs to understand that deferring incident prevention is accepting the risk of recurrence.

Get Incident Response Templates

Post-incident review templates, severity classification guides, and on-call rotation frameworks to strengthen your team's response capability.

Learn More

Incident Management: Respond and Learn Without Blame

Your Role During an Incident

Building Incident Readiness

Running Effective Post-Incident Reviews

Building a Culture of Reliability

Common Incident Management Mistakes

Key Takeaways

Frequently Asked Questions

Get Incident Response Templates

Related Articles

Engineering Team Growth: Beyond Adding Headcount

Stakeholder Management: Earn Trust, Autonomy, and Resources

Technical Strategy: Give Your Team Direction, Not Tasks

Engineering Process Improvement: Find & Fix Bottlenecks

Project Delivery: Ship On Time Without Burning Out Your Team

Team Health: Spot Problems Early, Before They Become Crises