Site Reliability Engineer – the IT guardian you should’ve heard of by now
Why SREs are protectors of the user experience
As described by Andrew Widdowson, “it’s like being a part of the world’s most intense pit crew. We change the tires of a race car as it’s going 100mph.”
Known as the “automaters”, SREs are often asked to observe application environments and mange incidents… at all hours of the day. Because everyone knows, when your app is down, so is your business.
The SRE’s job is to secure a flawless user-experience. To deliver site reliability. SREs bridge Dev and Ops, ensuring new releases improve the product, rather than breaking it.
The trouble with monitoring application environments is that there are hundreds of thousands of monitoring data points. How do you prioritize which data points are useful, and which can be ignored? Alarm storms aren’t helpful. They prompt panic, instead of resolution.
…And when a crucial incident does occur, how do you quickly mitigate it? The common SRE approach is to spend a ton of time and energy manually sifting through data – often at the expense of other initiatives, or worse, personal time (e.g. responding to the dinner-time incident alert).
What if you could get to that Aha! moment faster? What if instead of the typical hair-on-fire response, you had a trusted guide that could quickly lead you to the source of the incident?
If these symptoms describe your SRE team, or you’re a site reliability engineer who is anxiously awaiting a fix to this problem, check out kaizenOps.io.