On-call load measures the operational burden placed on engineers through incident response, alert handling, and production support duties. Managing this load effectively is crucial for preventing burnout, maintaining team morale, and ensuring that operational work does not crowd out feature development.
What Is On-Call Load?
On-call load encompasses the total volume of operational work generated by production systems, including pages, alerts, incident investigations, and follow-up remediation tasks. It is typically measured as the number of pages or alerts per on-call shift, the percentage of time spent on incident response, or the number of incidents requiring human intervention per week.
On-call load is distinct from on-call scheduling. A team might have a well-structured rotation but an unsustainable alert volume. Conversely, a team with few alerts but a poorly designed rotation can still experience burnout. Both dimensions need attention, but on-call load focuses specifically on the volume and intensity of operational work.
High on-call load has cascading effects beyond the engineers directly involved. When on-call engineers are frequently interrupted, they cannot focus on planned work, which shifts their sprint commitments to other team members. This creates an uneven distribution of feature work and can breed resentment between those who handle operational burden and those who do not.
How to Measure On-Call Load
Track the number of pages and alerts per on-call shift, distinguishing between actionable alerts that require human intervention and informational alerts that could be automated. Your incident management platform-PagerDuty, Opsgenie, or similar-provides this data automatically. Review it weekly to spot trends and identify problematic services.
Measure the percentage of engineering time consumed by operational work across the team. Include not just incident response time but also post-incident remediation, follow-up tasks, and the context-switching cost of being on call. A comprehensive view reveals the true cost of operational burden on your team's delivery capacity.
- Track pages per on-call shift, broken down by severity and service
- Measure the percentage of sprint capacity consumed by operational work
- Distinguish between actionable alerts and noise that should be automated or suppressed
- Record out-of-hours pages separately as they have a disproportionate impact on well-being
- Track time-to-acknowledge and time-to-resolve for each incident to measure operational efficiency
On-Call Load Benchmarks
Google's Site Reliability Engineering book recommends that on-call engineers should receive no more than two pages per twelve-hour shift on average. If the rate exceeds this, the team should invest in reliability improvements before adding new features. This benchmark is widely adopted across the industry as a reasonable target.
In terms of capacity allocation, aim for operational work to consume no more than twenty to twenty-five percent of your team's total engineering time. If ops work exceeds thirty percent, your team is in reactive mode and feature delivery will suffer significantly. Some organisations use the term toil budget to describe this allocation and track it explicitly.
Out-of-hours pages deserve special attention. Even a small number of nighttime pages has a disproportionate impact on engineer well-being and retention. Track out-of-hours pages separately and set aggressive targets to minimise them. Consider follow-the-sun rotations for globally distributed teams to eliminate nighttime pages entirely.
Strategies for Reducing On-Call Load
Start by eliminating noisy alerts. Audit every alert in your system and categorise it as actionable, informational, or noise. Suppress or auto-remediate alerts that do not require human intervention. Many teams find that fifty percent or more of their alerts are noise that can be eliminated without any impact on reliability.
Invest in reliability improvements for the services that generate the most pages. Use your incident data to identify the top three to five sources of on-call load and dedicate engineering effort to addressing their root causes. Common improvements include adding retry logic, improving error handling, increasing capacity margins, and fixing data consistency issues.
- Audit and eliminate noisy alerts that do not require human intervention
- Identify and fix the top sources of pages using incident data analysis
- Implement auto-remediation for common, well-understood failure modes
- Improve monitoring to catch issues before they become pages
- Distribute on-call load fairly across the team with well-designed rotations
Building a Sustainable On-Call Culture
Ensure on-call duties are distributed fairly and that compensation reflects the burden. Engineers who handle on-call should receive appropriate compensation, time off after particularly heavy shifts, and recognition for their contribution to system reliability. An unfair on-call distribution is one of the fastest paths to team attrition.
Run blameless post-mortems after significant incidents and track follow-up actions to completion. Post-mortems that generate action items but never result in improvements erode trust in the process. Assign owners and deadlines to every remediation task and review completion rates regularly.
Use on-call retrospectives to continuously improve the on-call experience. After each rotation, ask the on-call engineer about alert quality, documentation gaps, runbook accuracy, and any tools or processes that could be improved. These retrospectives surface specific, actionable improvements that gradually reduce on-call burden over time.
Key Takeaways
- On-call load measures the volume of operational work including pages, alerts, and incident response duties
- Target no more than two pages per twelve-hour shift and keep operational work below twenty-five percent of team capacity
- Audit alerts aggressively-many teams find that fifty percent or more of their alerts are actionable noise
- Invest in reliability improvements for the top sources of pages for the greatest impact
- Distribute on-call fairly, compensate appropriately, and run retrospectives after each rotation
Frequently Asked Questions
- How do we handle on-call for small teams?
- Small teams face particular challenges because the on-call rotation is more frequent for each individual. Consider sharing on-call responsibilities across related teams, investing heavily in reliability to minimise pages, and implementing auto-remediation for common issues. Some small teams use a secondary on-call tier from a broader engineering pool for escalations.
- Should on-call engineers also work on feature development?
- This depends on your on-call load. If pages are infrequent, combining on-call with feature work is feasible. If pages are frequent, dedicate on-call shifts to operational work, bug fixes, and reliability improvements. Google's SRE model recommends that on-call engineers spend at most fifty percent of their time on operational work during their rotation.
- How do we reduce out-of-hours pages?
- Start by analysing which alerts trigger outside business hours and whether they truly require immediate human response. Many can be deferred to the next business day, auto-remediated, or prevented through better capacity planning. For globally distributed teams, follow-the-sun rotations can eliminate out-of-hours pages entirely by routing alerts to engineers in appropriate time zones.
Get Incident Management Templates
Our Engineering Manager Templates include on-call rotation planners, incident response checklists, and post-mortem templates to help you build a sustainable operational practice.
Learn More