Skip to main content
50 Notion Templates 47% Off
...

Mean Time to Detect (MTTD): Catching Production Issues Faster

Learn how to measure and improve mean time to detect production issues. Covers monitoring strategies, benchmarks, and alerting best practices for engineering teams.

Last updated: 7 March 2026

Mean time to detect (MTTD) measures how long it takes to identify that a production issue has occurred. A short MTTD is the foundation of effective incident response-you cannot fix what you do not know about, and every minute of delayed detection extends the impact on your users.

What Is Mean Time to Detect?

Mean time to detect is the average elapsed time between when a production issue begins and when your team becomes aware of it. This includes the time for monitoring systems to register an anomaly, alerting thresholds to be breached, and the notification to reach a human who can begin investigation. MTTD is the first phase of the broader incident lifecycle that also includes mean time to acknowledge, mean time to resolve, and mean time to recovery.

MTTD is important because it directly affects the total duration and severity of incidents. An issue that takes five minutes to detect and thirty minutes to fix has a total impact of thirty-five minutes. The same issue with a sixty-minute detection delay has a total impact of ninety minutes-more than double-even though the fix time is identical. Improving detection time often yields greater reductions in total incident duration than improving fix time.

Detection can be proactive (your monitoring catches the issue before users notice) or reactive (users report the issue before your systems detect it). Tracking the ratio of proactive to reactive detection is itself a valuable metric. High-performing teams detect more than ninety percent of issues proactively through monitoring and alerting.

How to Measure Mean Time to Detect

To measure MTTD, record the timestamp when each incident began (often determined retrospectively during post-mortem analysis) and the timestamp when the team first became aware of the issue (the first alert, page, or user report). The difference between these two timestamps is the detection time for that incident. Average this across all incidents to calculate MTTD.

Accurate MTTD measurement requires good incident record-keeping. Your incident management tool should capture timestamps for key events in the incident lifecycle. During post-mortems, reconstruct the incident timeline to identify when the issue actually started versus when it was detected. This retrospective analysis often reveals that issues began well before they triggered alerts.

  • Record incident start time and detection time for every production issue
  • Calculate MTTD as the average detection time across all incidents
  • Track MTTD separately for different severity levels and services
  • Measure the ratio of proactive (monitoring-detected) to reactive (user-reported) incidents
  • Use post-mortem timelines to improve the accuracy of incident start time estimates

MTTD Benchmarks for Engineering Teams

High-performing engineering organisations target MTTD of under five minutes for critical issues and under fifteen minutes for major issues. These targets require comprehensive monitoring, well-tuned alerting thresholds, and reliable notification systems. Achieving sub-five-minute MTTD means your monitoring must detect anomalies within minutes and your alerting must notify the right people immediately.

The industry median for MTTD varies widely by organisation maturity. Organisations with mature observability practices typically achieve MTTD of five to fifteen minutes. Less mature organisations may take thirty minutes to several hours to detect issues, often relying on user reports rather than proactive monitoring.

Track the percentage of incidents detected proactively by monitoring versus reactively by user reports. High-performing teams detect ninety percent or more of incidents proactively. If more than thirty percent of your incidents are first reported by users, your monitoring coverage has significant gaps that need to be addressed.

Strategies for Improving MTTD

Invest in comprehensive monitoring across the four pillars of observability: metrics, logs, traces, and events. Metrics provide real-time numerical data about system health. Logs capture detailed event records for investigation. Traces follow individual requests through distributed systems. Events track deployments, configuration changes, and other significant occurrences.

Tune your alerting thresholds to balance sensitivity and specificity. Alerts that are too sensitive generate false positives that lead to alert fatigue. Alerts that are too conservative miss genuine issues. Use anomaly detection and baseline comparisons rather than static thresholds where possible, and regularly review and adjust your alerting rules based on incident data.

  • Implement monitoring across metrics, logs, traces, and events
  • Use anomaly detection rather than static thresholds for alerting where possible
  • Set up synthetic monitoring to detect issues from the user's perspective
  • Regularly review and tune alerting rules to reduce false positives
  • Implement health checks and canary endpoints for critical services

Building an Effective Monitoring Strategy

Start with user-facing metrics. Monitor error rates, latency percentiles, and throughput for every user-facing endpoint. These metrics directly reflect the user experience and are the most important signals for detecting production issues. If you can only monitor one thing, monitor your users' experience.

Layer in infrastructure and application metrics. Track CPU, memory, disk, and network utilisation alongside application-specific metrics like queue depths, connection pool usage, and cache hit rates. These metrics help diagnose issues and often provide early warning signals before user-facing metrics are affected.

Implement synthetic monitoring that simulates user interactions at regular intervals. Synthetic checks can detect issues even when real user traffic is low, such as during off-peak hours or for rarely-used features. They also provide a consistent baseline for comparison, making anomaly detection more reliable.

Key Takeaways

  • MTTD measures the time between when a production issue begins and when your team becomes aware of it
  • Target MTTD of under five minutes for critical issues-every minute of delayed detection extends user impact
  • High-performing teams detect over ninety percent of issues proactively through monitoring, not user reports
  • Invest in the four pillars of observability: metrics, logs, traces, and events
  • Use anomaly detection, synthetic monitoring, and well-tuned alerts to minimise detection time

Frequently Asked Questions

How does MTTD differ from MTTR?
MTTD measures time to detection (knowing an issue exists), whilst MTTR measures time to recovery (restoring normal service). MTTD is the first phase of incident response. Improving MTTD reduces the total MTTR because the faster you detect an issue, the sooner you can begin resolving it. Both metrics should be tracked and improved together.
What is the biggest barrier to improving MTTD?
The most common barrier is incomplete monitoring coverage. Many teams have excellent monitoring for their core services but gaps in supporting systems, third-party integrations, and edge cases. Conduct a monitoring audit to identify blind spots and prioritise coverage for areas that have generated undetected incidents in the past.
Should we alert on everything?
No. Alert only on conditions that require human intervention. Excessive alerting leads to alert fatigue, where engineers begin ignoring or silencing alerts, which ironically increases MTTD. Focus on high-signal alerts and use dashboards for informational monitoring that does not require immediate action.

Get Incident Response Templates

Our Engineering Manager Templates include monitoring checklists, alerting strategy guides, and incident response playbooks to help your team detect and respond to issues faster.

Learn More