How do we distinguish between incidents and service requests?

An incident is an unplanned event that disrupts or degrades service for users. A service request is a planned, expected action (like provisioning access). Define clear criteria based on user impact. If users experience degraded service, it is an incident regardless of the cause.

Should we count incidents caused by third-party outages?

Yes, but categorise them separately. Third-party incidents affect your users even if you did not cause them. Tracking them helps you evaluate vendor reliability and invest in resilience measures like fallbacks and graceful degradation for critical third-party dependencies.

What is an acceptable incident rate for a growing system?

As systems grow in complexity, some increase in absolute incident count is expected. Focus on incident rate per service or per deployment. If this normalised rate stays stable or decreases as you scale, your reliability practices are keeping pace with your growth.

Incident Rate: How to Measure & Reduce Outages

Individual incidents feel random. Aggregate them by root cause and a pattern almost always emerges - database connection exhaustion, deployment-related regressions, or a single under-monitored service. Incident rate is most powerful not as a scorecard, but as the starting point for root cause analysis that turns reactive firefighting into proactive prevention.

What Is Incident Rate?

Incident rate is the number of production incidents that occur over a defined period, typically measured weekly or monthly. Incidents are unplanned events that disrupt or degrade service for users. This metric provides a high-level view of your system's operational stability and your team's ability to prevent production issues.

Incident rate is related to but distinct from change failure rate. Change failure rate measures the percentage of deployments that cause incidents, whilst incident rate captures all incidents regardless of cause, including infrastructure failures, third-party outages, and capacity issues. Together, they provide a comprehensive picture of operational reliability.

For meaningful measurement, you need a clear and consistently applied definition of what constitutes an incident. Without this, incident rate becomes unreliable, as different team members may report or not report issues based on subjective judgement. Establish severity levels and clear criteria for each.

How to Measure Incident Rate

Track incidents through a dedicated incident management tool such as PagerDuty, Opsgenie, or a simple issue tracker with an incident-specific workflow. Every incident should be logged with its start time, end time, severity, root cause category, and affected services. This data enables trend analysis and targeted improvements.

Normalise incident rate by team size, number of services managed, or deployment volume for fairer comparisons. A team managing 20 microservices will naturally have more incidents than a team managing 3. Normalisation helps you understand whether incident rate changes reflect genuine reliability improvements or simply changes in scope.

Log every incident with severity, duration, root cause, and affected services
Use consistent severity definitions across all teams
Track incident rate weekly for operational awareness and monthly for trend analysis
Normalise by number of services or deployments for meaningful comparisons
Distinguish between incidents caused by your changes and those caused by external factors

Setting Incident Rate Targets

There are no universal incident rate benchmarks because incident definitions and service complexity vary enormously across organisations. Instead, focus on your own trend. A team that reduces its monthly incident count from 15 to 8 over six months has made significant progress regardless of what other teams experience.

Consider setting targets by severity level. You might aim for zero critical (P0) incidents per month, fewer than two high (P1) incidents per month, and focus less on minor (P3/P4) incidents that have minimal user impact. This severity-tiered approach ensures you prioritise the incidents that matter most.

Track the ratio of incidents to deployments as a complementary metric. If your incident rate remains stable whilst deployment frequency increases, your team is actually becoming more reliable on a per-deployment basis. This nuance is important when communicating with stakeholders.

Strategies to Reduce Incident Rate

Root cause analysis is the foundation of incident rate reduction. Categorise incidents by root cause: deployment failures, infrastructure issues, capacity problems, third-party outages, configuration errors, and so on. Identify the categories responsible for the most incidents and invest in systematic prevention for each.

Proactive monitoring and alerting catch issues before they become user-facing incidents. Implement health checks, canary monitoring, and anomaly detection to identify problems early. Many potential incidents can be prevented through automated scaling, circuit breakers, and self-healing systems that respond to early warning signals.

Conduct root cause analysis for every incident and categorise by failure mode
Invest in proactive monitoring to catch issues before they impact users
Implement automated scaling and circuit breakers for common failure scenarios
Run chaos engineering experiments to discover weaknesses before they cause incidents
Address the top root cause categories systematically through quarterly reliability initiatives

Building an Incident-Aware Culture

A healthy incident culture encourages reporting, not suppressing, incidents. If engineers fear blame for incidents, they will avoid reporting them, and your incident rate data will underrepresent reality. Blameless post-mortems, fair on-call practices, and leadership support for reliability investments all contribute to a culture where incidents are learning opportunities.

Regular incident reviews, where the team examines recent incidents and post-mortem action items, keep reliability top of mind. These reviews should be brief (30 minutes) and focused on patterns and systemic improvements rather than rehashing individual incidents. Monthly incident reviews work well for most teams.

Celebrate reliability improvements alongside feature delivery. When your team reduces incident rate by 50%, that achievement deserves the same recognition as launching a major feature. Making reliability visible and valued ensures it receives sustained attention and investment.

Key Takeaways

Incident rate measures the frequency of production incidents and reflects overall system reliability
Define clear, consistent incident severity levels and reporting criteria
Focus on your own trend over time rather than external benchmarks
Root cause analysis and proactive monitoring are the most effective reduction strategies
Build a blameless culture where incident reporting is encouraged and reliability is celebrated

Frequently Asked Questions

How do we distinguish between incidents and service requests?: An incident is an unplanned event that disrupts or degrades service for users. A service request is a planned, expected action (like provisioning access). Define clear criteria based on user impact. If users experience degraded service, it is an incident regardless of the cause.
Should we count incidents caused by third-party outages?: Yes, but categorise them separately. Third-party incidents affect your users even if you did not cause them. Tracking them helps you evaluate vendor reliability and invest in resilience measures like fallbacks and graceful degradation for critical third-party dependencies.
What is an acceptable incident rate for a growing system?: As systems grow in complexity, some increase in absolute incident count is expected. Focus on incident rate per service or per deployment. If this normalised rate stays stable or decreases as you scale, your reliability practices are keeping pace with your growth.

Get Your Incident Tracking Dashboard

Our templates include severity definitions, root cause taxonomies, and tracking dashboards that surface the patterns behind your incidents.

Learn More

Incident Rate: How to Measure & Reduce Outages

What Is Incident Rate?

How to Measure Incident Rate

Setting Incident Rate Targets

Strategies to Reduce Incident Rate

Building an Incident-Aware Culture

Key Takeaways

Frequently Asked Questions

Get Your Incident Tracking Dashboard

Related Articles

Bug Rate: How to Measure by Severity & Component

Cycle Time: Definition, Benchmarks & How to Reduce It

Pull Request Size: Ideal Limits & How to Enforce Them

Sprint Velocity: How to Measure & Use It for Planning

Code Coverage: Benchmarks, Targets & Best Practices

Technical Debt Ratio: How to Measure & Reduce It