Site Reliability Engineering: It's a Kind of Magic

SREs possess the skills and tools needed to adopt new tech and become an essential part of future-proofing the business.

Hagrid: You’re a wizard, Harry.

Harry Potter: I’m a what?

Hagrid: A wizard. And a thumping good one at that, I’d wager… once you train up a little.

The best bit from the first Harry Potter movie and an apt description of how it must feel to be a site reliability engineer. And what’s a site reliability engineer (SRE)? Well, it’s the IT equivalent of a wizard, or as Andrew Widdowson, an SRE at Google described it, “Like being part of the world’s most intense pit crew… changing the tires of a race car as it’s going 100 mph.”

So how is an SRE any different from traditional IT operations, and can a discipline originating in the world of web-scale, cloud-native “unicorns” ever apply to the steady-as-she-goes state of enterprise IT?

It can, because the notion of enterprise IT and a technology function being confined behind closed walls doesn’t exist anymore. Now, the only way to create and conduct business at scale is through mobile and cloud—meaning the operational focus shifts from “keeping the (technology) lights on”, towards engineering reliability at levels never before imagined.

This is different from traditional IT operations due to its emphasis on engineering. Reliability, like any feature, isn’t something that’s retrofitted after deployment; it’s established and enhanced as software is developed, tested and released. That means establishing a new discipline, which Ben Treynor—Google’s original SRE lead—describes as, “what happens when a software engineer is tasked with what used to be called operations.”

A Sobering Reality

It’s easy, of course, to throw out yet another three-letter acronym and claim it’s a magical elixir for all the problems involved with running complex IT systems. In reality, engineering reliability into distributed systems with thousands of containerized applications and microservices is a tough gig. Not least because of all the moving parts, but also because any preconceived notions about predictable system behavior no longer apply.

Take, for example, keeping watch over a modern software application. This might consist of business logic written in polyglot languages and linked to the legacy ERP system (custom-built or packaged or both). There will also be a raft of databases—traditional relational for transactional support, yes, but more likely a smorgasbord of NoSQL data stores (be that in-memory, graphing or document), perhaps fronted by recently-adopted Node.js. Some of this componentry will be on-premises, some will be containerized and moved to the public cloud. That might mean Docker and Kubernetes on AWS, but maybe Azure and Mesos—heck, why not both for some hybrid-style resilience?

But like the old Monty Python sketch, “you’ll be lucky” if this is all you ever have to manage. Depending on the nature of the business, there might also be a glut of third-party services—including payment processing and reconciliation. That’s not to mention all the new web and mobile apps interacting with the core business systems through an API gateway and possibly some analytics horsepower delivered by the likes of Hadoop and ElasticSearch.

It’ll take a lot of operational wizardry to keep all that performant.

Fortune Favors the Bold

In a wonderful talk at SREcon earlier this year, Julia Evans from Stripe described the realities of managing today’s complex distributed systems. What was refreshing about her presentation was the open admission that she often finds the work difficult and there’s always a ton of new stuff to learn. As she says in her abstract, and maybe just a tad like Harry Potter, she doesn’t always feel like a wizard.

This honesty illustrates what’s exciting about being an SRE. With systems like those described above causing any number of thorny problems, it’ll be the inquisitive and brave that keep business on track. Being an SRE isn’t for the feint-hearted or those happy with a fire-fighting status-quo. It’s for those within our ranks who get bored easily—those super sleuths who keep asking reliability questions, crafting improvements and learning as they go.

Let’s consider a typical business-critical problem that could impact our aforementioned modern application. For instance, let’s say some latency issue is causing an increasing number of mobile app users to abandon a booking service. How would teams address the issue? Problems like this might go unnoticed for some time, or there could be a deluge of alarms. Even when a problem is identified, where do teams find the root-cause? Is it a problem with a new code release or at the API gateway? Is it a down to some weird microservice auto-scaling issue and was that earlier CPU increase we thought was okay actually really bad?

With an SRE-style approach, business critical problems are never addressed in knee-jerk fashion. Using modern tooling in areas such as application performance management and app analytics, SREs can observe the real-time behavior of applications, with systems collecting and correlating information from all related components. Rather than reacting after the fact, these solutions continuously identify anomalous patterns (like those mobile app abandonments) and compare them to historical trends—meaning SRE’s are alerted well before the business is impacted.

But beyond exposing new-normal application weirdness and “unknown-unknowns”, modern tools also encourage and stimulate more of the SRE detective work—the really valuable stuff. These tools won’t just detect anomalies and then leave teams scrambling to find the needle in a haystack. No sir—they’ll analytically gather all the evidence and lead teams in a fact-based fashion toward a solution. For example, using an SRE-inspired monitoring service to detect a performance anomaly introduced with a new software build and then tracing to the actual code causing the problem. 

Like Potter, operations professionals might have a hard time accepting they’re wizards. But ask yourself this: Do you want to remain a silly muggle getting burnt out by constant fire fighting? Of course not—it’s career-limiting and sucks. It is time then for some SRE magic—gaining the skills and tools needed to adopt new tech like containers and microservices—to become an essential part of future-proofing your business.

You’re a wizard, right?

Peter Waterhouse
By Peter Waterhouse | November 13, 2017