What is site reliability engineering (SRE) and how is it different from DevOps?

By Christian Meléndez | Posted Jan 15, 2021 | 8 min. (1584 words)

Site reliability engineering (SRE) is Google’s method of service management where software engineers run production systems using a software engineering approach. It’s clear that Google is unique, and they usually need to tackle software bugs and errors in different and non-conventional ways. But having software engineers doing a job that is traditionally done by professionals with a systems administration background sounds impractical. Well, it was good enough to succeed, so Google decided to share their wisdom in a book format a few years ago.

After the book was released, SRE roles were embraced across the board by modern software teams like Atlassian.

Raygun lets you detect and diagnose errors and performance issues in your codebase with ease

It takes minutes to add Raygun into your software. Be alerted to issues affecting end users and replicate problems 1,000x faster than using logs and incomplete information from users. Learn more and try Raygun free for 14 days.

In today’s post, I’m going to explain what SRE is and why SRE helps maintain software quality in production systems. You might notice that SRE looks like DevOps, so I’ll also explain how these two concepts relate to each other. Finally, I’ll share a few tips on how the hiring process changes with the SRE approach. Let’s get into it.

A software engineering approach to run production systems

Benjamin Treynor is the founder of Google’s SRE, and he explains that “SRE is what happens when you ask a software engineer to design an operations team.” Treynor’s background has always been software engineering, so it wasn’t a surprise that he decided to design an operations team with mostly development skills. Traditionally, operations tasks don’t scale well when they rely too heavily on humans doing things manually. Google’s SRE team ended up automating their tedious manual tasks, even if they were complex.

Just because the team was bored doing the same manual tasks every day wasn’t the only reason for adding automation. A common challenge when operating a production system is to keep it both up and running and as performant as possible. Usually, systems go down because we need to release new features or because the system infrastructure is running out of capacity. The latter can be automated. But the first one is more difficult because something is always going to go wrong.

That’s why SRE accepts failure and will try to fail in a controlled manner. One way SRE achieves this is by using an error budget.

Error budgets

An error budget tells you the minimum amount of reliability required. A system that is reliable 100 percent of the time is expensive, and users might not even notice if the system goes down for one minute. But users will notice when the system is down for an extended period, even if you don’t. Then what should you do?

Well, start by defining what level of reliability is acceptable for the system. This level will become the target you shouldn’t violate, but it should be achievable enough that it doesn’t lower the pace of change or affect innovation. When a deployment fails, you’ll be able to decide when to stop it and when to keep it going—and you’ll be able to do it in a controlled manner.

Reliability is the most important feature of an SRE team.

Now let’s talk about what the team does to be reliable.

What does a site reliability engineer do?

How do SREs keep to the error budget and have a reliable system? To answer this question, let me talk about the four core SRE principles an engineer will implement on a daily basis.

1. Ensuring an engineering focus

SREs intentionally spend a certain amount of time on reducing human labor, sharing knowledge among teams, and creating a blameless culture. Keeping track of system’s reliability. Knowing what’s happening inside the systems error reporting software is crucial. Engineers design the software to automate routine tasks results in a self-healing system. Humans will be notified only when their decision criteria are needed.

2. Bringing the system back online

How the team responds to emergencies is what allows them to keep on error budget when something goes wrong. Software engineering removes the human factor and helps to alleviate the pain of failing by recovering rapidly.

3. Maintaining compliance with change management

When removing the human factor from the equation, change management needs to have automation. By leaving a trail, this not only increases confidence that company rules aren’t ignored, but it also increases the deploy and release velocity by removing the time needed to make a decision.

4. Forecasting and provisioning the capacity of the system

SRE teams will provide the capacity when it’s needed and optimize the resources when they’re not required. Ensuring the capacity the system will need when there’s a demand for it is vital to maintaining the system’s availability.

The above list of practices and principles will not be a surprise for an operations team—it’s what they traditionally do. What’s different about the software engineering approach is that it’s working to keep a system reliable under the limits of its availability. The team will be less bored performing manual tasks, they’ll be interrupted less, and as a result, the team will become more creative in finding solutions to complex problems.

How does site reliability engineering relate to DevOps?

At this point, you’ve probably noticed some similarities in SRE with DevOps. {Link} DevOps isn’t a detailed guide on how to run production systems, which is why everyone has their own way of doing DevOps. Some people even say that DevOps is mainly a cultural mindset. Therefore, SRE looks like DevOps because of the similarity in the practices and principles. But as Google’s engineer says, SRE is the implementation of DevOps with a clear set of practices.

Let’s look at that idea a little more closely. The DevOps movement began with the goal of closing gaps between development and operations but without giving any detailed information on how to do it. That’s one reason why DevOps looks different in every company that decides to implement it. Whereas in SRE, the gap between teams is mostly closed by having the same set of tools and practices.

At Google, an SRE acts initially as a consultant for a developer’s team, working together with the developers to build the service. Then, developers can hand over managing the system to the SRE when there’s a clear understanding of what to do when something goes wrong—and by this time the system will be quite fault-tolerant and self-healing.

So SRE is not just another way of saying DevOps, but it’s also not a competitor trying to get rid of DevOps.

How to hire a site reliability engineer?

By now, an SRE team might look like a good fit for your company.

So how do you hire a site reliability engineer? Like Benjamin Treynor, that person should have a solid background in software engineering with experience in Unix system internals and networking.

In an SRE role, the engineer will be in charge of running production systems and automating human labor through software.

An SRE needs to not only understand code but also be good at creating something from scratch. They should have experience with at least one programming language like Go, Python, or Ruby. An SRE who knows these languages will be able to extend tools like Ansible, Chef, Kubernetes, Docker, and Terraform.

A traditional sysadmin is still a good candidate for this role if they are willing to spend significant time figuring out how to reduce toil. (Toil is a term explained in the SRE book— it’s all that repetitive manual work that can be automated but is still being done by a human.) As the system grows, toil grows linearly, demanding more humans for the job. An SRE’s main function will be to reduce toil as much as possible.

An SRE will also have excellent troubleshooting skills to find solutions to issues and prevent them from happening again. Therefore, an SRE will use meaningful metrics to trigger automatic remediations and start troubleshooting when there’s a new problem, or if automation didn’t solve the problem. An SRE engineer might like to lean on an error monitoring tool.

Regarding infrastructure, an SRE needs to have experience with infrastructure as code, immutable infrastructure, distributed systems, configuration management, continuous integration, and continuous delivery. In other words, they need the expertise to build a resilient system that is easy to change and evolve over time.

In a nutshell, if you’re looking to build an SRE team, you should be looking for software engineers with infrastructure experience.

SRE isn’t just for Google

There’s a lot more to talk about regarding error budgets and SRE practices. I didn’t get into that many details, but hopefully, SRE is more clear now. Even though this way of working was born in Google, it’s not just for the giants of the software world. SREs are now employed by modern software teams to ensure their software is reliable for their customers.

As software developers take increased responsibility in SRE, there will be an increased reliance on software monitoring tools like crash reporting, APM, and real user monitoring.

If you want to learn more about SRE, there’s a video series from Google that’s worth watching it. The first SRE book is also available for free, and the latest one is an SRE workbook. More and more companies are already starting to implement SRE, which means that it will only continue to be adopted.