What is the difference between MTTR and MTTD?

MTTD (mean time to detect) measures how long it takes to discover that an incident has occurred. MTTR (mean time to recovery) includes detection time plus the time to diagnose and resolve the issue. MTTD is a component of MTTR, and reducing detection time is often the quickest way to improve MTTR.

How do we improve MTTR without burning out our on-call engineers?

Focus on automation and preparation rather than heroic individual effort. Automated monitoring, clear runbooks, and well-defined escalation paths reduce the cognitive burden on on-call engineers. Also ensure on-call rotations are fair and sustainable, with adequate rest periods between shifts.

Should we measure MTTR for all incidents or just major ones?

Measure MTTR for all incidents but segment by severity. Overall MTTR gives you the big picture, whilst severity-specific MTTR helps you understand your response capability at each level. Critical incidents warrant the most detailed analysis and targeted improvement efforts.

MTTR: How to Measure & Improve Recovery Time

When production breaks, the clock starts. Every minute of downtime damages user trust, drains team energy, and pressures engineers into shortcuts. Mean time to recovery measures that clock - and teams that invest in observability, clear incident roles, and practised runbooks consistently stop it sooner than those relying on heroic individual effort.

Understanding Mean Time to Recovery

MTTR in the DORA context measures the elapsed time from when an incident begins (service degradation is detected) to when service is fully restored. This includes detection time, diagnosis time, remediation time, and verification time. A low MTTR indicates a team that can respond quickly and effectively to production issues.

MTTR is often confused with related metrics such as mean time between failures (MTBF) and mean time to detect (MTTD). MTBF measures the average time between incidents, whilst MTTD measures how quickly incidents are detected. MTTR encompasses the full recovery timeline from detection to resolution, making it the most comprehensive measure of incident response capability.

For engineering managers, MTTR is a crucial metric because it directly impacts customer experience and team well-being. Long recovery times mean extended outages for users and stressful, protracted incident responses for engineers. Reducing MTTR improves both user satisfaction and team quality of life.

How to Measure MTTR

MTTR is calculated by summing the duration of all incidents over a period and dividing by the number of incidents. Each incident's duration runs from when it is detected (or reported) to when service is confirmed restored. Use your incident management tool's timestamps for consistency and accuracy.

As with other metrics, use the median rather than the mean to avoid skew from occasional major incidents. Track both the median and the 90th percentile to understand typical recovery times and worst-case scenarios. Segment MTTR by severity level, as critical incidents and minor incidents have very different recovery profiles.

Measure from incident detection to confirmed service restoration
Use median values alongside 90th percentile for a complete picture
Segment by incident severity to understand recovery times at each level
Track MTTR trends monthly to identify improvements or regressions
Include all incidents, not just major outages, for comprehensive measurement

MTTR Benchmarks

Elite performers recover from incidents in less than one hour. High performers recover within one day. Medium performers take between one day and one week, whilst low performers may take more than six months to recover from incidents. The gap between tiers is striking and underscores the importance of investing in incident response capabilities.

Sub-hour recovery times require a combination of excellent observability, well-practised incident response processes, and the technical ability to quickly deploy fixes or rollbacks. Teams at this level have automated runbooks, comprehensive monitoring, and clear escalation paths that minimise time spent on diagnosis and coordination.

If your team's MTTR is measured in days rather than hours, focus first on reducing detection time through better monitoring and alerting. Many teams spend the majority of their incident duration simply unaware that a problem exists. Proactive monitoring can dramatically reduce overall MTTR.

Strategies to Reduce MTTR

Invest in observability: comprehensive logging, metrics, and distributed tracing that allow engineers to quickly identify the source and scope of an incident. Without good observability, engineers spend valuable time during incidents simply trying to understand what is happening. Tools like structured logging, application performance monitoring, and real-time dashboards are essential.

Establish clear incident response procedures including defined roles (incident commander, communications lead, technical lead), communication channels, and escalation paths. When an incident occurs, engineers should know exactly what to do and who to contact. Regular incident response drills help teams practise these procedures under low-stress conditions.

Implement comprehensive observability with logging, metrics, and tracing
Define clear incident response roles and procedures
Maintain automated rollback capabilities for rapid recovery
Conduct regular incident response drills to build team readiness
Create and maintain runbooks for common failure scenarios

Learning from Incidents to Improve MTTR

Blameless post-mortems are the cornerstone of MTTR improvement. After every significant incident, conduct a structured review that examines the timeline, root cause, contributing factors, and potential improvements. Focus on systemic fixes rather than individual blame. Document the findings and track action items to completion.

Categorise incidents by root cause and failure mode to identify patterns. If database connection exhaustion causes recurring incidents, that is a systemic issue requiring architectural attention. If deployment-related incidents are common, your release process needs improvement. Pattern analysis turns individual incidents into strategic improvement opportunities.

Share learnings widely across the engineering organisation. One team's incident can provide valuable lessons for other teams facing similar challenges. Regular incident review meetings, shared post-mortem repositories, and engineering-wide presentations all help spread knowledge and raise the collective resilience of your organisation.

Key Takeaways

MTTR measures the time from incident detection to confirmed service restoration
Elite performers recover in under one hour, whilst low performers may take weeks or longer
Observability investment is the fastest path to MTTR improvement through faster detection and diagnosis
Clear incident response procedures and regular drills reduce coordination overhead during incidents
Blameless post-mortems and pattern analysis drive systematic MTTR improvement over time

Frequently Asked Questions

What is the difference between MTTR and MTTD?: MTTD (mean time to detect) measures how long it takes to discover that an incident has occurred. MTTR (mean time to recovery) includes detection time plus the time to diagnose and resolve the issue. MTTD is a component of MTTR, and reducing detection time is often the quickest way to improve MTTR.
How do we improve MTTR without burning out our on-call engineers?: Focus on automation and preparation rather than heroic individual effort. Automated monitoring, clear runbooks, and well-defined escalation paths reduce the cognitive burden on on-call engineers. Also ensure on-call rotations are fair and sustainable, with adequate rest periods between shifts.
Should we measure MTTR for all incidents or just major ones?: Measure MTTR for all incidents but segment by severity. Overall MTTR gives you the big picture, whilst severity-specific MTTR helps you understand your response capability at each level. Critical incidents warrant the most detailed analysis and targeted improvement efforts.

Cut Your Recovery Time in Half

Our Field Guide includes incident response frameworks, runbook templates, and the escalation structures that shrink MTTR fastest.

Learn More

MTTR: How to Measure & Improve Recovery Time

Understanding Mean Time to Recovery

How to Measure MTTR

MTTR Benchmarks

Strategies to Reduce MTTR

Learning from Incidents to Improve MTTR

Key Takeaways

Frequently Asked Questions

Cut Your Recovery Time in Half

Related Articles

Engineering Throughput: How to Measure It Responsibly

Engineering Velocity: How to Measure It Properly

Code Review Time: How to Measure & Set SLAs

Incident Rate: How to Measure & Reduce Outages

Bug Rate: How to Measure by Severity & Component

Cycle Time: Definition, Benchmarks & How to Reduce It