Site Reliability Engineer – the IT guardian you should’ve heard of by now

Why SREs are protectors of the user experience

Being a site reliability engineer isn’t easy.

As described by Andrew Widdowson, “it’s like being a part of the world’s most intense pit crew. We change the tires of a race car as it’s going 100mph.”

Known as the “automaters”, SREs are often asked to observe application environments and mange incidents… at all hours of the day. Because everyone knows, when your app is down, so is your business.

The SRE’s job is to secure a flawless user-experience. To deliver site reliability. SREs bridge Dev and Ops, ensuring new releases improve the product, rather than breaking it.

The Challenge

The trouble with monitoring application environments is that there are hundreds of thousands of monitoring data points. How do you prioritize which data points are useful, and which can be ignored? Alarm storms aren’t helpful. They prompt panic, instead of resolution.

…And when a crucial incident does occur, how do you quickly mitigate it? The common SRE approach is to spend a ton of time and energy manually sifting through data – often at the expense of other initiatives, or worse, personal time (e.g. responding to the dinner-time incident alert).

What if you could get to that Aha! moment faster? What if instead of the typical hair-on-fire response, you had a trusted guide that could quickly lead you to the source of the incident?

If these symptoms describe your SRE team, or you’re a site reliability engineer who is anxiously awaiting a fix to this problem, check out kaizenOps.io.


Kyle Curry is a product marketing manager at CA Technologies, supporting the project teams in…

Comments

Modern Software Factory Hub

Your source for the tips, tools and insights to power your digital transformation.
Read more >
RECOMMENDED
Qualities to Look for When Researching DevOps ToolsHow Software Developers Can Become Cybersecurity EngineersEssential DevOps Security Principles