DevOps and SRE teams operate at the intersection of software development and operations, responsible for the reliability, performance, and scalability of production systems. Managing these teams requires understanding their unique challenges: the tension between reactive operations and proactive improvement, the burden of on-call duties, and the need to influence engineering practices across the organisation.
Defining and Managing Service Level Objectives
Service Level Objectives (SLOs) provide the quantitative foundation for reliability decisions. Work with product and engineering leadership to define SLOs that reflect genuine business needs - not aspirational targets that are impossible to maintain, and not lax targets that allow poor user experience.
Use error budgets to make trade-off decisions. When a service is within its error budget, the team can focus on feature development and improvement. When the budget is depleted, reliability work takes priority. This framework removes emotion from the velocity-versus-reliability debate and replaces it with data.
Review SLOs regularly and adjust them as the business evolves. SLOs that were appropriate for a startup may be too lax for an enterprise product. SLOs that were set during rapid growth may be unrealistic during a period of infrastructure transition.
Identifying and Reducing Toil
Toil is the repetitive, manual, automatable work that scales linearly with service growth. Left unchecked, toil consumes the team's capacity and prevents investment in improvement. Track the percentage of time spent on toil and set targets for reduction.
Prioritise toil reduction based on frequency and effort. A task that takes 30 minutes and happens daily is a higher priority to automate than a task that takes two hours but happens once a month. Focus automation efforts where they will save the most time.
Create a culture where toil is reported and addressed, not silently absorbed. Engineers who accept toil as 'just part of the job' prevent it from being recognised as a problem. Make toil visible through tracking, discuss it in retrospectives, and allocate specific capacity for toil reduction.
Building Sustainable On-Call Practices
On-call is one of the biggest sources of burnout in DevOps and SRE teams. Design your on-call rotation to be sustainable: reasonable rotation frequency (no more than one week in four), clear escalation paths, and adequate compensation or time off for on-call duty.
Reduce on-call burden through investment in alerting quality. Noisy alerts - false positives, non-actionable notifications, and alerts that always self-resolve - create alert fatigue and sleep disruption. Regularly audit your alerts and remove or improve ones that are not actionable.
After every significant on-call incident, review whether improvements could prevent recurrence. If the same issue pages the on-call engineer repeatedly, it should be prioritised for permanent resolution. Track the number of pages per on-call shift and aim for continuous reduction.
Building a Reliability Culture Across Engineering
Reliability is not solely the SRE team's responsibility - it is a shared concern across all engineering teams. Build a culture where product engineers consider reliability in their design, write production-ready code, and participate in on-call rotations for their own services.
Use SRE engagement models that incentivise product team ownership. If the SRE team automatically takes responsibility for every service, product teams have no incentive to build reliable systems. Consider models where SRE support is earned through meeting reliability standards.
Share incident learnings broadly. Post-mortem reports, incident reviews, and reliability metrics should be visible to the entire engineering organisation. This transparency helps all teams learn from failures and builds a shared understanding of operational challenges.
Balancing Operations with Strategic Improvement
Allocate team capacity explicitly between reactive operations (incidents, support, toil) and proactive improvement (automation, tooling, architecture). A common target is 50% improvement work - if the team spends more than 50% on reactive operations, the systems are too unreliable or the team is too small.
Protect improvement time fiercely. Reactive work always feels more urgent than proactive work, and without protection, the team will spend all its time firefighting. Block time for improvement work and treat it as non-negotiable unless a genuine emergency requires all hands.
Invest in observability as a force multiplier. Better monitoring, logging, and tracing reduce mean time to detection and resolution, which frees up time for improvement work. The feedback loop between better observability and faster incident resolution is one of the highest-return investments an SRE team can make.
Key Takeaways
- Define SLOs based on genuine business needs and use error budgets to make trade-off decisions
- Track and reduce toil systematically - it compounds and consumes improvement capacity
- Design sustainable on-call rotations and invest in alerting quality to reduce burden
- Build reliability culture across engineering - reliability is a shared responsibility, not an SRE-only concern
- Protect at least 50% of team capacity for proactive improvement work
Frequently Asked Questions
- What is the difference between DevOps and SRE?
- DevOps is a set of practices and cultural principles focused on collaboration between development and operations. SRE is a specific implementation of those principles, originating at Google, that applies software engineering practices to operations problems. In practice, many organisations use the terms interchangeably. What matters more than the label is the team's focus: improving reliability, reducing toil, and building systems that scale.
- How do I prevent burnout on my SRE team?
- Monitor on-call burden, toil levels, and work hours closely. If engineers are being paged frequently, spending most of their time on reactive work, or working excessive hours, intervention is needed. Reduce on-call frequency, invest in automation to reduce toil, and push back on taking on additional services without additional headcount. Rotate people between high-stress operational roles and lower-stress project work.
- Should product teams own their own on-call?
- Ideally, yes. Product teams that own their services end-to-end - including on-call - have stronger incentives to build reliable, operable systems. The SRE team can provide tooling, training, and support for product teams' on-call rotations rather than being the sole on-call for all services. This model scales better and distributes the operational knowledge more broadly.
Access SRE Best Practices Guide
Explore our field guide for SRE and DevOps leadership, including SLO templates, on-call rotation designs, and toil reduction frameworks.
Learn More