Skip to main content
50 Notion Templates 47% Off
...

SLA, SLO, and SLI: A Complete Guide for Engineering Managers

Master SLAs, SLOs, and SLIs for engineering teams. Covers service level definitions, error budgets, reliability targets, and practical implementation for engineering managers.

Last updated: 7 March 2026

Service Level Agreements (SLAs), Service Level Objectives (SLOs), and Service Level Indicators (SLIs) form a framework for defining, measuring, and managing service reliability. Together, they provide a shared language between engineering, product, and business teams for making informed trade-offs between reliability and feature velocity. This guide helps engineering managers implement these concepts to drive better reliability decisions.

Understanding SLAs, SLOs, and SLIs

The three concepts form a hierarchy. Service Level Indicators (SLIs) are the metrics you measure - request latency, error rate, availability, throughput. Service Level Objectives (SLOs) are the targets you set for those metrics - 99.9% of requests should complete within 200 milliseconds. Service Level Agreements (SLAs) are the contractual commitments you make to customers, with consequences (typically financial) for failing to meet them.

For engineering managers, SLOs are the most important of the three. SLAs are typically set by business and legal teams, and SLIs are chosen by engineers based on what can be measured. But SLOs sit at the intersection - they require engineering judgement to set realistic targets, product input to understand user expectations, and business context to determine the cost of reliability failures. Getting SLOs right is a cross-functional exercise.

The key insight behind this framework is that 100% reliability is neither achievable nor desirable. Every percentage point of additional reliability comes at exponentially increasing cost. An SLO of 99.9% availability (roughly 8.7 hours of downtime per year) is dramatically cheaper to maintain than 99.99% (roughly 52 minutes per year). The framework forces explicit conversations about how much reliability is enough, based on user needs and business impact.

  • SLIs are the metrics - what you measure (latency, availability, error rate, throughput)
  • SLOs are the targets - what level of reliability you aim for (99.9% availability)
  • SLAs are the contracts - what you promise customers, with consequences for breaches
  • SLOs should always be stricter than SLAs to provide a safety buffer
  • Error budgets - the acceptable amount of unreliability - enable data-driven trade-offs between velocity and reliability

Choosing the Right SLIs

Good SLIs measure what users actually experience, not what is convenient to measure. Server CPU utilisation might be easy to monitor, but it tells you nothing about user experience unless it directly correlates with latency or errors. The best SLIs are derived from user-facing metrics: request success rate, page load time, API response latency at meaningful percentiles (p50, p95, p99), and transaction completion rate.

Choose SLIs that are simple, meaningful, and actionable. A team should be able to look at an SLI dashboard and immediately understand whether users are having a good experience. Avoid composite metrics that combine multiple signals into a single number - they obscure the specific dimension of service quality that is degrading. Instead, track a small number (three to five) of independent SLIs that together provide a complete picture of service health.

Measure SLIs from the user's perspective whenever possible. A health check endpoint that returns 200 OK while users experience errors creates a dangerous illusion of reliability. Synthetic monitoring that simulates real user journeys, real user monitoring (RUM) in the browser, and metrics collected at the load balancer or API gateway level all provide more accurate pictures than internal service health checks.

Setting Meaningful SLOs

Start by analysing your current performance data. If your service has historically maintained 99.95% availability, setting an SLO of 99.99% requires significant investment, while 99.9% provides a comfortable buffer. Set SLOs based on achievable targets that reflect user expectations, not aspirational goals that the team cannot realistically meet. An SLO that is constantly breached loses credibility and becomes meaningless.

Involve product managers and business stakeholders in the SLO-setting process. They bring crucial context about user tolerance and business impact. An internal tool used by ten employees has very different reliability requirements than a payment processing API handling millions of pounds in transactions. SLOs should be calibrated to the actual impact of unreliability on users and the business.

Define SLOs over meaningful time windows - typically 28-day or 30-day rolling windows. Avoid calendar month windows, which create perverse incentives (a team that burns their error budget early in the month has no incentive to maintain reliability for the remainder). Rolling windows provide consistent pressure to maintain reliability at all times, with recent incidents naturally rolling off as they age past the window.

Error Budgets and Decision-Making

The error budget is the complement of your SLO - if your SLO is 99.9% availability, your error budget is 0.1% unavailability over the measurement window. This translates to approximately 43 minutes of downtime per 30-day window. The error budget is spent by incidents, deployments that cause errors, and maintenance activities. It is replenished as time passes and old incidents fall outside the window.

Error budgets transform the reliability versus velocity debate from a subjective argument into a data-driven discussion. When the error budget is healthy, teams can deploy aggressively, run experiments, and accept the risk of occasional failures. When the error budget is nearly exhausted, the team should shift focus to reliability work - fixing the root causes of recent incidents, improving monitoring, and reducing deployment risk.

Establish clear policies for error budget exhaustion. A common approach is: when the error budget is below 50%, the team must include reliability work in their sprint. When it reaches zero, feature development pauses and the team focuses exclusively on reliability until the budget recovers. These policies must have leadership buy-in - they only work if product managers and executives respect the budget constraints.

Practical Implementation

Start with one or two critical services rather than trying to implement SLOs across your entire platform at once. Choose services with clear user impact and existing monitoring data. Define two or three SLIs for each service, set initial SLOs based on historical performance, and build dashboards that show current SLI values, SLO compliance, and remaining error budget. Use tools like Datadog, Grafana, or dedicated SLO platforms to automate the tracking.

Socialise the concepts with your team before implementing them. Engineers who understand why SLOs exist and how error budgets work will engage with the framework productively. Engineers who see SLOs as just another metric imposed by management will treat them as noise. Run a workshop explaining the concepts, walk through examples from your own services, and let the team participate in choosing SLIs and setting initial targets.

Iterate on your SLOs quarterly. Initial targets are educated guesses - you will learn from operating with them whether they are too tight (constantly breached despite good user experience), too loose (never breached despite occasional user complaints), or misaligned with what users actually care about. Treat SLOs as living targets that evolve with your understanding of user needs and system behaviour.

Key Takeaways

  • SLIs measure user experience, SLOs set reliability targets, and SLAs are contractual commitments
  • 100% reliability is neither achievable nor desirable - SLOs make the trade-off explicit
  • Error budgets transform reliability decisions from subjective debates into data-driven discussions
  • Set SLOs based on historical data, user expectations, and business impact - not aspirational goals
  • Start with one or two critical services and iterate on targets quarterly

Frequently Asked Questions

What is the difference between an SLO and an SLA?
An SLO is an internal reliability target that your engineering team uses to guide decisions and prioritisation. An SLA is an external contractual commitment to customers that carries consequences - usually financial penalties - for breach. Your SLOs should always be stricter than your SLAs. If your SLA promises 99.9% availability, your internal SLO might target 99.95%, giving you a buffer to detect and address reliability issues before they breach the contractual commitment.
How do you handle SLO breaches?
Treat an SLO breach as a signal, not a punishment. When an SLO is breached, conduct a review to understand what happened, update runbooks and monitoring as needed, and adjust priorities to include reliability work. If breaches are frequent, either the system needs significant reliability investment or the SLO is set too aggressively. If breaches never occur, the SLO may be too lenient. The goal is to breach occasionally - that means the target is appropriately calibrated.
Should every service have SLOs?
Not necessarily. Start with services that directly impact users or revenue. Internal tools, batch processing jobs, and development infrastructure may not need formal SLOs initially. As your SLO practice matures, extend coverage to additional services. The effort of maintaining SLOs - tracking, reviewing, and acting on them - should be proportional to the service's importance. A service nobody depends on does not need the overhead of formal reliability targets.
How do you get product managers to respect error budgets?
Involve product managers from the start. When they participate in setting SLOs and understanding error budgets, they become partners in the reliability conversation rather than adversaries. Show them the data: when the error budget is healthy, highlight how it enables faster feature delivery. When it is depleted, show the user impact of recent incidents. Frame error budgets as a tool that maximises long-term velocity by preventing the reliability debt that causes costly outages.

Discover Engineering Manager Tools

Explore our reliability management tools, including SLO calculators, error budget trackers, and incident response frameworks for engineering managers.

Learn More