On-call rotations are a necessary part of running production systems, but poorly designed rotations burn out engineers, create resentment, and ultimately harm both people and system reliability. This guide covers how to design sustainable on-call rotations that distribute the burden fairly, compensate engineers appropriately, and continuously improve through feedback and automation.
Designing Fair and Sustainable Rotations
A well-designed on-call rotation starts with the right team size. The minimum sustainable rotation requires at least four to five engineers - fewer than this and the on-call frequency becomes burdensome. Each engineer should be on call no more than one week in four, with one week in six being a better target for long-term sustainability.
Distribute on-call responsibility equitably across the team, including senior engineers and tech leads. Exempting senior people from on-call creates resentment and removes them from the operational reality of the systems they design. However, consider adjusting expectations - senior engineers may provide escalation support rather than primary coverage.
Define clear on-call responsibilities and expectations. What is the expected response time for a page? What decisions can the on-call engineer make independently, and what requires escalation? What constitutes a valid page versus a false alarm? Document these expectations so that on-call engineers can act confidently at 3 AM without second-guessing themselves.
- Maintain at least four to five engineers in the rotation for sustainability
- Target no more than one week on call out of every four to six weeks
- Include senior engineers and tech leads in the rotation to maintain equity and operational awareness
- Document clear expectations for response times, decision authority, and escalation procedures
Compensating and Recognising On-Call Work
On-call work imposes a real cost on engineers - restricted personal time, disrupted sleep, and the stress of being tethered to a pager. Compensate this fairly. Compensation models vary: additional pay per on-call shift, compensating time off after on-call weeks, or a combination of both. Whatever model you choose, ensure it is consistent and transparent.
Recognise on-call contributions visibly. Include on-call burden in performance reviews, acknowledge engineers who handle difficult incidents well, and make on-call work visible to leadership. On-call work is often invisible - it happens outside working hours and is forgotten by the next sprint planning. Ensure it is valued appropriately.
Track on-call burden metrics and share them with leadership. Pages per shift, off-hours pages, incident duration, and the ratio of actionable to false alarms provide data that supports investment in reliability improvements and adequate staffing. If the data shows unsustainable on-call burden, use it to advocate for change.
Improving Alert Quality
The quality of your alerts directly determines the quality of your on-call experience. Noisy alerts - false positives, non-actionable notifications, and alerts that fire for transient conditions - create fatigue and train engineers to ignore pages. Every alert should be actionable: it should tell the on-call engineer what is wrong and what they need to do.
Regularly audit your alert configuration. Review every alert that fired in the past month: Was it actionable? Did the engineer need to intervene? Could the issue have been resolved automatically? Remove or improve alerts that are not meeting these standards. Aim for zero false-positive pages.
Implement tiered alerting. Not every issue requires waking someone up. Use severity levels that distinguish between immediate pages, warnings that can wait until morning, and informational notifications that do not page at all. Route alerts appropriately based on severity and time of day.
Preventing On-Call Burnout
On-call burnout is a serious risk that leads to attrition, reduced performance, and ultimately degraded system reliability. Monitor for signs of burnout: resistance to being on call, declining responsiveness during on-call shifts, and engineers expressing frustration about the on-call burden.
After a particularly difficult on-call shift - one with multiple incidents, significant off-hours pages, or extended troubleshooting - provide recovery time. A day or two of reduced workload after a demanding shift allows engineers to recover and signals that the organisation values their wellbeing.
Invest continuously in reducing on-call burden. Every toil item automated, every false alert removed, and every runbook improved makes on-call less painful. Track the trend in pages per shift over time and set a target of continuous reduction. The goal is not to eliminate on-call but to make it manageable and predictable.
Key Takeaways
- Design rotations with adequate team size and equitable distribution including senior engineers
- Compensate on-call fairly through pay, time off, or both, and recognise contributions in performance reviews
- Audit and improve alert quality continuously - every page should be actionable
- Prevent burnout through recovery time after difficult shifts and ongoing burden reduction
- Track on-call metrics and use data to advocate for staffing and reliability investments
Frequently Asked Questions
- How do I handle engineers who refuse to participate in on-call?
- First, understand their reasons - legitimate concerns about health, family obligations, or unsustainable burden should be addressed. If the concern is about fairness (unequal distribution, poor compensation, or excessive noise), fix those underlying issues. If the refusal is simply about not wanting the responsibility, have an honest conversation about on-call being a fundamental part of owning production systems. In most engineering organisations, on-call participation is an expectation of the role.
- Should on-call engineers continue their regular sprint work?
- Reduce the on-call engineer's sprint commitments during their on-call week. Expecting full sprint velocity plus on-call responsibility is unreasonable and leads to either dropped sprint work (frustrating the team) or delayed incident response (risking reliability). A common approach is assigning 50% of normal capacity during on-call weeks.
- How do I set up on-call for a new service with no incident history?
- Start conservatively with broad alerting and refine based on experience. For the first few weeks, alert on conditions that might indicate problems even if you are not certain they are actionable. After accumulating data on what actually constitutes a real issue versus noise, tighten your alerting to reduce false positives. Document every incident and the response to build runbooks for future on-call engineers.
Explore On-Call Management Resources
Access our field guide for on-call management, including rotation design templates, alert quality audit frameworks, and burnout prevention playbooks.
Learn More