How Google Runs Production Systems — A Book by Google’s Site Reliability Engineers

Arunachalam Alagappan
10 min readJun 23, 2023

--

How Google Runs Production Systems, a book by the site reliability engineers at Google, has comprehensive insights on what site reliability engineering (SRE) really means, the roles and responsibilities of a site reliability engineer, and their goals and principles. Google, known for running services such as Gmail, YouTube, Google Maps, Google Drive, etc. on a massive scale with great reliability, invented the concept of site reliability engineering. I have provided a summary of the book with this post; this will surely be an exciting read.

Site Reliability Engineering — A Introduction

Software engineering is a discipline that deals with designing and building software systems. Then there must be another discipline that focuses on the lifecycle of the software product, from inception through deployment, operation, refinement, and eventually decommissioning. Google names this discipline site reliability engineering. Ben Traylor Sloss, Google’s VP for 24/7 Operations, coined the term SRE. According to him, reliability is fundamental. A system is not useful if nobody can use it, even if it is up all the time.

Key Learnings that can be applied:

-> The error budget provides a clear metric that determines how unreliable the service is allowed to be in a given quarter. If the allowed error budget has been spent by the dev team, then no more new launches or releases will be allowed until the error budget is brought within the threshold.

-> Every site reliability engineer should spend at least 50% of their time on automation or eliminating toil(manual tasks).

-> Avoid software bloat. Ensuring modularity and keeping the APIs/code loosely coupled helps us make changes to the system in isolation and to debug easily by narrowing down to the specific module.

-> For effective troubleshooting with large systems, bifurcate the components of the system into parts, like the frontend layer, backend layer, and DB, and narrow down on which layer is likely causing the issue.

-> Every negative result or antipattern must be documented in the postmortem phase because negative results are magic to any team as they can avoid it from occurring in future.

-> When it comes to new product/feature launch, have a launch coordination checklist in place, so that the team will be familiar with the likely challenges/pitfalls of launching a product to millions of users.

-> Logging is another invaluable tool. It is really useful to have multiple verbosity levels available with information about each operation logged.

-> We should document the history of outages. The document in which the history of outages is maintained is called the Outalator.

-> Impose outages intentionally to ensure the systems react the way they will, determine unexpected weaknesses, and identify ways to make the system more resilient in such situations.

-> Effective recovery mechanisms, such as taking full backups during off-peak hours and taking incremental backups during business hours, are imperative.

Responsibilities Of SRE

The SRE team is responsible for ensuring that the system adheres to ensure — Availability, Performance, and Reliability.

Availability

A system should have high uptime; a system is not useful for anyone if it’s not up. Availability is measured in terms of nines — https://en.wikipedia.org/wiki/High_availability#%22Nines%22

Reliability

A system should always give the desired output. We should have the services return the correct response instead of just being available.

Performance

A system should be able to run the same way under heavy loads. And for customers, speed matters. Be it the page load time, if the speed is not within the expected threshold, then we may have customers moving to the competitor’s website. Google’s recommended page load time is 2 seconds, and as per them, anything above that will increase the chances of customers leaving the site.

SLI, SLO, And SLA

SLI — Service Level Indicator — is about what we are going to measure (metrics). For instance, for an API, we measure error budget and throughput. An error budget is the ratio of failed requests to successful requests, whereas throughput is the maximum number of requests the system needs to handle.

SLO — Service Level Objective — is simply SLI + thresholds, like what values of SLI are allowed. SLI is a subset of SLO. For example, in the case of an API, it is allowed to have an error budget of less than 1% averaged over some time and a maximum throughput of 10,000 requests per second.

SLA — Service Level Agreement — is what happens when the threshold of SLO is not met, and the consequences are faced as part of it.

It is important to measure SLOs like a user; for example, measuring error rates and latency at the Gmail client rather than at the server helped increase Google’s availability to 99.99% from 99.0% in a few years.

Why SRE Is Important

SRE teams will check the feasibility of new feature launches by setting up SLAs (agreements) based on SLIs (indicators) and SLOs (objectives). By having a standard SLO and error budget, if the allowed error budget has been spent by the dev team, then no more new launches or releases will be allowed until the error budget is brought within the threshold.

Embracing Risk

SREs should ensure that the system is highly reliable, but after a point, maximizing reliability will not be meaningful. If we have a reliability target of 99.99%, we strive to make the system reliable up to that point, not beyond that, because the user might still face trouble due to other factors such as a weak internet network or an inefficient device. Hence, high reliability is prudent, and extreme reliability is not needed or desired. Instead of spending time on extreme reliability, SREs can take a calculated risk of unavailability and bank on adding new features to the system, innovations, automation activities, etc.

Eliminating Toil

Any activity that tends to be manual, repetitive, or automatable and that scales linearly as a service grows is Toil. It is better to run a script than execute a set of commands manually. As per Google, every SRE should spend at least 50% of their time on automation or eliminating toil.

DevOps Vs SRE

A DevOps team works towards improving the speed and quality of software delivery. They take care of CI/CD (continuous integration/continuous delivery) — build/release activities. While SRE is focused on ensuring the reliability and performance of systems and services through tasks such as monitoring, alerting, change management, automation, emergency response, etc.,

Monitoring Vs Observability

Monitoring is capturing and displaying information on the failures through alerts and metrics, whereas Observability can decipher what is happening based on the logs, traces, and metrics and help pinpoint the root cause of the failure. Monitoring is just a subset of Observability.

Symptoms Vs Causes

A Monitoring system should cater to two parts,

1. What’s broken

2. Why

What’s broken is the symptom, and why is the cause. For instance, I am serving http 500s, which is a symptom, and database servers are refusing connections, which is a cause.

White Box Monitoring Vs Black Box Monitoring

White Box Monitoring delves into monitoring the innards of the system, such as the number of HTTP requests per second, the number of DB queries running on the database, etc., whereas Black Box Monitoring is about measuring the state of the services, such as whether they are up and running, CPU/disk usage, etc. White Box Monitoring is more valuable than Black Box Monitoring. White Box Monitoring is transparent, and we can identify the cause of any failure through the intricacies it provides, whereas Black Box Monitoring is opaque and symptom-oriented, and it just predicts that something is wrong with the system right now.

Four Golden Rules Of Monitoring

It includes Latency, Traffic, Errors, and Saturation,

Latency

It is the time it takes to serve a request. It is pivotal to measure both successful requests and failed requests. Success latency is important, but a slow error is also more frustrating than a fast error.

Traffic

It is the measure of how much demand is being placed on your system. For a web service or API, it is the number of HTTP requests per second; for a key- or value-based DB like Redis or Cassandra, it is the number of documents retrieved per second.

Errors

The errors can be both explicit and implicit. Explicit errors fall under something like HTTP 500s, whereas implicit errors are the ones that return successful responses but with wrong data, like a HTTP 200 with a wrong JSON response or any service that is intended to give the response back in 1 second, and if it takes more than that, it is an error.

Saturation

It is a measure of how well a system can perform before it reaches its maximum utilization, say, 100%. Can our system successfully handle double the traffic, or can it handle only 10% more traffic? Measuring the 99th percentile response time throughout a small window, like one minute, will give a clear view of saturation. The 99th percentile response time consists of the top 1% of frustrated users (users with a high response time). Say, if the 99th Percentile response time is 5 secs, then out of 100 requests, 1 request took the max response time of 5 secs.

Simplicity:

It is pivotal to keep the software product as simple as possible because every line of code added to the project can potentially introduce a bug. Hence, it is prudent to remove the clutter and avoid software bloat. A smaller project is easier to comprehend and test, and it has fewer defects. Writing clear and minimal APIs is instrumental in maintaining software simplicity. Ensuring modularity and keeping the APIs loosely coupled helps us make changes to the system in isolation. If a bug is seen as part of a larger system, it will be easier to narrow it down to the specific module and rectify it without disturbing the other modules.

Effective Troubleshooting:

Once we have a set of observations collected, it is pivotal to look at the system’s logs and telemetry to narrow down the failure. Extensive logging with high verbosity levels is important for debugging. The divide-and-conquer approach is best suited for large systems; it will be too slow to debug on a linear basis. For such large systems, bifurcate the components of the system into parts, like the frontend layer, backend layer, and DB, and narrow down on which layer is likely causing the issue. It is also important to correlate the system state with recent changes and know what touched it last — was the failure due to a recent configuration change or a recent deployment?

Scaling Up Vs Scaling Out

Scale Vertically (Scale-Up):

It is about increasing or upgrading the hardware on the machine — a stronger CPU, more RAM, etc.

Scale Horizontally(Scale-Out)

It is about increasing the number of servers, enabling auto-scaling, and spinning up more EC2 instances automatically so that the system can serve more traffic and requests.

Postmortem Culture

A postmortem is a written record of the impact of the incidents that occurred before, their root causes, and the actions taken to rectify them. A postmortem should always be blameless; it should look at deciphering the causes of the failures rather than indicting any individual or pointing fingers. A blameless postmortem should have documentation on what went wrong and how the services can be improved by learning from the failures that happened in the past. Every negative result or antipattern must be documented in the postmortem phase because negative results are magic to any team. For example, a dev team might decide against using a web server that can handle only 800 connections out of the required 8,000 connections before failing due to lock contention. As this was documented previously by another team, other teams could avoid pitfalls and performance limits in existing systems.

Emergency Response:

It is about how well we respond when there is an emergency. All problems have solutions. If the person attending the incident can’t think of a solution, all other team members must be involved, as the highest priority should always be solving the problem quickly. Very importantly, once the emergency is mitigated, the incident response process should be recorded so as to prevent a similar outage from recurring. The document in which the history of outages is maintained is called the Outalator.

Disaster Recovery

It is the act of pushing production systems to their maximum limit and imposing outages intentionally to ensure the systems react the way they will, determine unexpected weaknesses, and identify ways to make the system more resilient in such situations.

Capacity Planning

Capacity planning means having the required redundancy and capacity to meet future demands. This falls into two categories: organic and inorganic growth. Organic growth is due to a natural increase in the number of users who are using the product, and inorganic growth implies an increase in product usage during specific campaigns or launches, such as a Thanksgiving Day or a Black Friday launch.

Data Integrity Vs. Data Availability

A SLO of 99.99% uptime will permit a downtime of only one hour in a whole year. Consider the scenario of a prolonged data outage with an email provider. The outage lasted for ten days, and the provider announced that, despite the long haul, the contacts and emails were lost. The users were furious and had to opt out. But several days after the mishap, the provider intimated that the data could be recovered. In this case, the data was not available for a chronic period of ten days. Here, data integrity without robust data availability is pretty much the same as having no data at all. Effective recovery mechanisms, such as taking full backups during off-peak hours and taking incremental backups during business hours, are imperative.

Other Resources:

Google SREs also regularly publish blogs and articles related to observability at this link: https://sre.google/books/.

Site Reliability Engineering has truly introduced a paradigm shift in the way large-scale systems are managed by the traditional software engineering approach through its principles and heuristics. I would highly recommend the book “How Google Runs Production Systems” to anyone who is big-eyed on observability (SRE) and wants to know what SRE really means.

--

--

Arunachalam Alagappan
Arunachalam Alagappan

No responses yet