How do we handle on-call for small teams?

Small teams face particular challenges because the on-call rotation is more frequent for each individual. Consider sharing on-call responsibilities across related teams, investing heavily in reliability to minimise pages, and implementing auto-remediation for common issues. Some small teams use a secondary on-call tier from a broader engineering pool for escalations.

Should on-call engineers also work on feature development?

This depends on your on-call load. If pages are infrequent, combining on-call with feature work is feasible. If pages are frequent, dedicate on-call shifts to operational work, bug fixes, and reliability improvements. Google's SRE model recommends that on-call engineers spend at most fifty percent of their time on operational work during their rotation.

How do we reduce out-of-hours pages?

Start by analysing which alerts trigger outside business hours and whether they truly require immediate human response. Many can be deferred to the next business day, auto-remediated, or prevented through better capacity planning. For globally distributed teams, follow-the-sun rotations can eliminate out-of-hours pages entirely by routing alerts to engineers in appropriate time zones.

On-Call Load: How to Measure & Reduce Alert Volume

Excessive on-call load does not just hurt morale - it directly reduces your team's capacity to build new things. Every unplanned page pulls an engineer out of focused work, and the recovery cost of that interruption far exceeds the incident itself. Measuring on-call load exposes how much of your sprint capacity is silently consumed by operational firefighting.

What Is On-Call Load?

On-call load encompasses the total volume of operational work generated by production systems, including pages, alerts, incident investigations, and follow-up remediation tasks. It is typically measured as the number of pages or alerts per on-call shift, the percentage of time spent on incident response, or the number of incidents requiring human intervention per week.

On-call load is distinct from on-call scheduling. A team might have a well-structured rotation but an unsustainable alert volume. Conversely, a team with few alerts but a poorly designed rotation can still experience burnout. Both dimensions need attention, but on-call load focuses specifically on the volume and intensity of operational work.

High on-call load has cascading effects beyond the engineers directly involved. When on-call engineers are frequently interrupted, they cannot focus on planned work, which shifts their sprint commitments to other team members. This creates an uneven distribution of feature work and can breed resentment between those who handle operational burden and those who do not.

How to Measure On-Call Load

Track the number of pages and alerts per on-call shift, distinguishing between actionable alerts that require human intervention and informational alerts that could be automated. Your incident management platform-PagerDuty, Opsgenie, or similar-provides this data automatically. Review it weekly to spot trends and identify problematic services.

Measure the percentage of engineering time consumed by operational work across the team. Include not just incident response time but also post-incident remediation, follow-up tasks, and the context-switching cost of being on call. A comprehensive view reveals the true cost of operational burden on your team's delivery capacity.

Track pages per on-call shift, broken down by severity and service
Measure the percentage of sprint capacity consumed by operational work
Distinguish between actionable alerts and noise that should be automated or suppressed
Record out-of-hours pages separately as they have a disproportionate impact on well-being
Track time-to-acknowledge and time-to-resolve for each incident to measure operational efficiency

On-Call Load Benchmarks

Google's Site Reliability Engineering book recommends that on-call engineers should receive no more than two pages per twelve-hour shift on average. If the rate exceeds this, the team should invest in reliability improvements before adding new features. This benchmark is widely adopted across the industry as a reasonable target.

In terms of capacity allocation, aim for operational work to consume no more than twenty to twenty-five percent of your team's total engineering time. If ops work exceeds thirty percent, your team is in reactive mode and feature delivery will suffer significantly. Some organisations use the term toil budget to describe this allocation and track it explicitly.

Out-of-hours pages deserve special attention. Even a small number of nighttime pages has a disproportionate impact on engineer well-being and retention. Track out-of-hours pages separately and set aggressive targets to minimise them. Consider follow-the-sun rotations for globally distributed teams to eliminate nighttime pages entirely.

Strategies for Reducing On-Call Load

Start by eliminating noisy alerts. Audit every alert in your system and categorise it as actionable, informational, or noise. Suppress or auto-remediate alerts that do not require human intervention. Many teams find that fifty percent or more of their alerts are noise that can be eliminated without any impact on reliability.

Invest in reliability improvements for the services that generate the most pages. Use your incident data to identify the top three to five sources of on-call load and dedicate engineering effort to addressing their root causes. Common improvements include adding retry logic, improving error handling, increasing capacity margins, and fixing data consistency issues.

Audit and eliminate noisy alerts that do not require human intervention
Identify and fix the top sources of pages using incident data analysis
Implement auto-remediation for common, well-understood failure modes
Improve monitoring to catch issues before they become pages
Distribute on-call load fairly across the team with well-designed rotations

Building a Sustainable On-Call Culture

Ensure on-call duties are distributed fairly and that compensation reflects the burden. Engineers who handle on-call should receive appropriate compensation, time off after particularly heavy shifts, and recognition for their contribution to system reliability. An unfair on-call distribution is one of the fastest paths to team attrition.

Run blameless post-mortems after significant incidents and track follow-up actions to completion. Post-mortems that generate action items but never result in improvements erode trust in the process. Assign owners and deadlines to every remediation task and review completion rates regularly.

Use on-call retrospectives to continuously improve the on-call experience. After each rotation, ask the on-call engineer about alert quality, documentation gaps, runbook accuracy, and any tools or processes that could be improved. These retrospectives surface specific, actionable improvements that gradually reduce on-call burden over time.

Key Takeaways

On-call load measures the volume of operational work including pages, alerts, and incident response duties
Target no more than two pages per twelve-hour shift and keep operational work below twenty-five percent of team capacity
Audit alerts aggressively-many teams find that fifty percent or more of their alerts are actionable noise
Invest in reliability improvements for the top sources of pages for the greatest impact
Distribute on-call fairly, compensate appropriately, and run retrospectives after each rotation

Frequently Asked Questions

How do we handle on-call for small teams?: Small teams face particular challenges because the on-call rotation is more frequent for each individual. Consider sharing on-call responsibilities across related teams, investing heavily in reliability to minimise pages, and implementing auto-remediation for common issues. Some small teams use a secondary on-call tier from a broader engineering pool for escalations.
Should on-call engineers also work on feature development?: This depends on your on-call load. If pages are infrequent, combining on-call with feature work is feasible. If pages are frequent, dedicate on-call shifts to operational work, bug fixes, and reliability improvements. Google's SRE model recommends that on-call engineers spend at most fifty percent of their time on operational work during their rotation.
How do we reduce out-of-hours pages?: Start by analysing which alerts trigger outside business hours and whether they truly require immediate human response. Many can be deferred to the next business day, auto-remediated, or prevented through better capacity planning. For globally distributed teams, follow-the-sun rotations can eliminate out-of-hours pages entirely by routing alerts to engineers in appropriate time zones.

Build a Sustainable On-Call Rotation

Our templates include rotation planners, alert-triage checklists, and toil-tracking dashboards that protect your team from on-call burnout.

Learn More

On-Call Load: How to Measure & Reduce Alert Volume

What Is On-Call Load?

How to Measure On-Call Load

On-Call Load Benchmarks

Strategies for Reducing On-Call Load

Building a Sustainable On-Call Culture

Key Takeaways

Frequently Asked Questions

Build a Sustainable On-Call Rotation

Related Articles

Developer Satisfaction: How to Measure & Improve It

Flow Efficiency: Formula, Benchmarks & How to Improve It

Work in Progress (WIP) Limits: How to Set & Enforce

Escaped Defects: Definition, Tracking & How to Reduce Them

Mean Time to Detect (MTTD): How to Measure & Improve

Code Churn: Definition, Benchmarks & How to Reduce It