Skip to main content
50 Notion Templates 47% Off
...

MTTR (Mean Time to Recovery): Measuring Incident Response Effectiveness

Learn how to measure and improve MTTR as a DORA metric. Practical strategies for engineering managers to reduce recovery time and build resilient systems.

Last updated: 7 March 2026

Mean time to recovery (MTTR) measures how long it takes to restore service after an incident or outage. As a DORA stability metric, it reflects your team's ability to detect, diagnose, and resolve production issues quickly, which is critical for maintaining user trust and service reliability.

Understanding Mean Time to Recovery

MTTR in the DORA context measures the elapsed time from when an incident begins (service degradation is detected) to when service is fully restored. This includes detection time, diagnosis time, remediation time, and verification time. A low MTTR indicates a team that can respond quickly and effectively to production issues.

MTTR is often confused with related metrics such as mean time between failures (MTBF) and mean time to detect (MTTD). MTBF measures the average time between incidents, whilst MTTD measures how quickly incidents are detected. MTTR encompasses the full recovery timeline from detection to resolution, making it the most comprehensive measure of incident response capability.

For engineering managers, MTTR is a crucial metric because it directly impacts customer experience and team well-being. Long recovery times mean extended outages for users and stressful, protracted incident responses for engineers. Reducing MTTR improves both user satisfaction and team quality of life.

How to Measure MTTR

MTTR is calculated by summing the duration of all incidents over a period and dividing by the number of incidents. Each incident's duration runs from when it is detected (or reported) to when service is confirmed restored. Use your incident management tool's timestamps for consistency and accuracy.

As with other metrics, use the median rather than the mean to avoid skew from occasional major incidents. Track both the median and the 90th percentile to understand typical recovery times and worst-case scenarios. Segment MTTR by severity level, as critical incidents and minor incidents have very different recovery profiles.

  • Measure from incident detection to confirmed service restoration
  • Use median values alongside 90th percentile for a complete picture
  • Segment by incident severity to understand recovery times at each level
  • Track MTTR trends monthly to identify improvements or regressions
  • Include all incidents, not just major outages, for comprehensive measurement

MTTR Benchmarks

Elite performers recover from incidents in less than one hour. High performers recover within one day. Medium performers take between one day and one week, whilst low performers may take more than six months to recover from incidents. The gap between tiers is striking and underscores the importance of investing in incident response capabilities.

Sub-hour recovery times require a combination of excellent observability, well-practised incident response processes, and the technical ability to quickly deploy fixes or rollbacks. Teams at this level have automated runbooks, comprehensive monitoring, and clear escalation paths that minimise time spent on diagnosis and coordination.

If your team's MTTR is measured in days rather than hours, focus first on reducing detection time through better monitoring and alerting. Many teams spend the majority of their incident duration simply unaware that a problem exists. Proactive monitoring can dramatically reduce overall MTTR.

Strategies to Reduce MTTR

Invest in observability: comprehensive logging, metrics, and distributed tracing that allow engineers to quickly identify the source and scope of an incident. Without good observability, engineers spend valuable time during incidents simply trying to understand what is happening. Tools like structured logging, application performance monitoring, and real-time dashboards are essential.

Establish clear incident response procedures including defined roles (incident commander, communications lead, technical lead), communication channels, and escalation paths. When an incident occurs, engineers should know exactly what to do and who to contact. Regular incident response drills help teams practise these procedures under low-stress conditions.

  • Implement comprehensive observability with logging, metrics, and tracing
  • Define clear incident response roles and procedures
  • Maintain automated rollback capabilities for rapid recovery
  • Conduct regular incident response drills to build team readiness
  • Create and maintain runbooks for common failure scenarios

Learning from Incidents to Improve MTTR

Blameless post-mortems are the cornerstone of MTTR improvement. After every significant incident, conduct a structured review that examines the timeline, root cause, contributing factors, and potential improvements. Focus on systemic fixes rather than individual blame. Document the findings and track action items to completion.

Categorise incidents by root cause and failure mode to identify patterns. If database connection exhaustion causes recurring incidents, that is a systemic issue requiring architectural attention. If deployment-related incidents are common, your release process needs improvement. Pattern analysis turns individual incidents into strategic improvement opportunities.

Share learnings widely across the engineering organisation. One team's incident can provide valuable lessons for other teams facing similar challenges. Regular incident review meetings, shared post-mortem repositories, and engineering-wide presentations all help spread knowledge and raise the collective resilience of your organisation.

Key Takeaways

  • MTTR measures the time from incident detection to confirmed service restoration
  • Elite performers recover in under one hour, whilst low performers may take weeks or longer
  • Observability investment is the fastest path to MTTR improvement through faster detection and diagnosis
  • Clear incident response procedures and regular drills reduce coordination overhead during incidents
  • Blameless post-mortems and pattern analysis drive systematic MTTR improvement over time

Frequently Asked Questions

What is the difference between MTTR and MTTD?
MTTD (mean time to detect) measures how long it takes to discover that an incident has occurred. MTTR (mean time to recovery) includes detection time plus the time to diagnose and resolve the issue. MTTD is a component of MTTR, and reducing detection time is often the quickest way to improve MTTR.
How do we improve MTTR without burning out our on-call engineers?
Focus on automation and preparation rather than heroic individual effort. Automated monitoring, clear runbooks, and well-defined escalation paths reduce the cognitive burden on on-call engineers. Also ensure on-call rotations are fair and sustainable, with adequate rest periods between shifts.
Should we measure MTTR for all incidents or just major ones?
Measure MTTR for all incidents but segment by severity. Overall MTTR gives you the big picture, whilst severity-specific MTTR helps you understand your response capability at each level. Critical incidents warrant the most detailed analysis and targeted improvement efforts.

Build Your Incident Response Framework

Access our Engineering Manager's Field Guide for comprehensive incident response frameworks and runbook templates.

Learn More