Social Engineering Service Level Objectives
The secret of change is to focus all of your energy, not on fighting the old, but on building the new.
– Not That Socrates
What’s Your Objective?
As software services became more complex and distributed, Service Level Objectives (SLOs) emerged as a powerful tool for improving reliability. I fell in love with SLOs since first reading the Site Reliability Engineering (SRE) book in 2016. I gave everyone who would listen a homework assignment to read the SRE Principles chapters ‘Embracing Risk’, ‘Service Level Objectives’ and ‘Eliminating Toil’ ever since and added my thoughts on the evolution of SRE and DevOps as a Foreword in the ‘The Site Reliability Workbook’.
In theory, SLOs offer an elegant solution to the finger-pointing that we often witness between developers and operators by enabling dispassionate data-driven decisions about how to improve. In practice, I frequently encounter organizations where teams with SRE titles have not adopted SLOs. From observation, there seem to be two significant hurdles after an organization aspires to this SRE practice, first, getting Service Level Indicators (SLIs) that everyone buys into and trusts, second, that the SLO error budgets actually prompt any change in priorities and behaviors.
Simple, Not Easy
The idea of SLOs is simple: treat the availability and performance of a service as a first-class feature that gets measured and prioritized. SLOs represent an organizational commitment. To have a real impact, the SLO must not be something that the SREs do unilaterally but a social contract with the business and software engineers. That social contract drives reliability more than the most accurate SLO measurement ever could.
Obviously, establishing SLOs is a prerequisite to using them to make decisions about how to allocate resources and prioritize work. SLOs provide a quantitative measure of how well a service is performing, but we need to get everyone to the table before we can have meaningful conversations about how to interpret the metrics or improve the systems. The SLO development lifecycle (SLODLC) The SLO development lifecycle (SLODLC) is a fantastic community-developed framework to start the SLO journey. Ideally, SLOs provide a common language for teams to communicate with each other and with stakeholders and the SLODLC initiates the stakeholder conversation explicitly from the beginning.
What Could Go Wrong?
Getting alignment on actionable SLOs represents a significant behavioral shift for many organizations. The idea sounds great until the enterprise product managers hear that all the developers are going to stop working on features to focus on reliability. ‘Sure, reliability sounds great, but we’ve got scope to deliver on schedule,’ they’ll say. Unfortunately, many organizations have adopted software strategies and incentives that predate delivering software as a service. As a consequence, operations is often an underfunded mandate that doesn’t concern the developers while the business presumes they can expect ‘all the nines.’
Let’s assume that SLOs exist: the service is defined, stakeholders identified, SLIs measured, error budgets established. The SLODLC suggests ‘Enforce Error Budget Policy’ as a next step, which might be a fantastic strategy in more tech-centric organizations but from experience, this can be a major point of friction. If data really drove improvements, we’d live in a very different world. Data ends up being less important than how people feel and how they are incentivized. Many organizations struggle at this stage because SLO enforcement gets framed as an SRE centric initiative and consequently might feel imposed on teams that haven’t fully bought into reliability as their concern, probably because reliability hasn’t been organizationally incentivized for them. The social contract to make SLOs impactful has not been established.
What’s In This For Me?
Everyone wants to be the hero in the story, each having different motivation, understanding, and interpretation of what that means. If we don’t work through the organizational dynamics to align those narratives, enforcing SLOs may start to look suspiciously like the age-old dev versus ops finger-pointing. Aligning can be simple, but that should not be confused with easy. People usually won’t be motivated to change when they don’t see the problem. With the problem established, small misalignments in the solution hinder progress. Even when the solution appears agreeable, the implementation details of what, who, how and when become an issue proportional to the level of behavior change relative to the incentive for change.
Small and local changes might be possible with no more than a good conversation between teams, but institutionalizing changes across an organization most often stalls if there isn’t a methodical program to align people. In a large enough organization, the services and service teams are frequently starting with different levels of understanding and empowerment. ‘SLO enforcement’ can successfully be reframed as the prompt for the right conversations to happen at the right time with the right people. Most changes end up being small and local, so don’t underestimate the organizational value of small successes to build enthusiastic coalitions. Progress usually comes one team, one service, one SLO at a time.
If you are interested in this and SLOs in general, check out SLOConf, a free virtual event about all things SLO.