Alerts – When Actions Speak Louder than Words (on Consoles)

by October 11, 2017

 

Are you spending lots of time responding to false alerts and noise? If you are, then it’s no wonder a career in IT Ops is often considered a hard slog. Like sitting through a  long and boring power point presentation. You know there’s some nugget of useful information, but finding it buried in a stream of wordy slides is next to impossible.

Beware the Operational Dead Zone

It’s a quirk of nature that we humanoids tend to zone out when things get dull and repetitive. Think about the last time you took a long drive. Like me, have you ever jolted back into the moment and wondered what the heck you’d been doing for the last twenty minutes. Driving of course, but how much can you recall?

It’s the same in IT Ops where we can lose any sense of urgency and zone out. So how often do we game the system to make life tolerable? Like perhaps hacking up some automation that rejigs an alert threshold. That might sound like a fair cheat, but it’s not exactly fool proof, right?

Avoiding Unnecessary Sleep Deprivation with Application Performance Management

Having to address repeat problems at 3:00am sucks. No problem I hear you say, just caffeinate quickly and kick off a handy script that kills a few processes and reboots a suspect server. Than back to bed and forget about it until your next on-call rotation. But even if we fully document our efforts and update the runbook, do we really have a permanent solution? Of course not – we’ve just contributed to the problem by applying a band-aid fix, with applications limping from one problem to the next.

But faced with stresses of modern IT operations it’s perhaps understandable that folks often cut corners. There’s just never seems enough time in the day to work up permanent solutions. Yet another reason why IT Ops need solutions that substitute unlimited servings of noise, distraction and repeat problems with an alerting mechanism that’s both intelligent and actionable. That’s exactly how we’ve designed and engineered CA Application Performance Management (CA APM); incorporating modern techniques that help end alert-fatigue and reduce on-call costs. This includes:

  • Monitoring Alerts for users and their experience

Do you customers really care about a Java memory leak problems with garbage collection failing to recognize unused objects. Nope, they care about loss of functionality and responsiveness. CA APM addresses this by monitoring from the end-user perspective. Keeping watch over symptoms that potentially impact business outcomes. It goes further by gathering evidence and automating detailed transaction traces and workflows.

  • Bringing teams and their knowledge together

How often have you been unsure about the impact of alerts on end-users? Consider the case where an alert indicates increased latency for HTTP traffic. Your trusty monitoring tools can identify the cause of the problem, but they can’t tell you which services fail the customer experience test. So as you rub your eyes and pull up the pajamas are your efforts worth the pain? Perhaps, perhaps not. You’ll only know when data is combined and correlated. It’s an issue we considered when designing CA APM’s data model and agent technology. Functionality that allows infrastructure metrics to be layered in context onto existing application topology maps. The net-net – actions driven by customer impacting conditions with unified visibility for faster problem resolution.

  • Seeing the Wood through the Trees

In the past, monolithic apps meant teams could generally account for most conditions with simplistic alert rules. But with microservices  it’s impossible to assess the performance of a system as a single component. That’s why CA APM Differential Analysis learns application behavior and identifies uncontrolled variance in response times across groups of apps and microservices. By distinguishing serious problems from nuisance alerts, teams only get alerted when necessary. You can check out a neat write-up and cool video put together by my colleague David Martin, here.

 

On-call support is an important and necessary part of IT operations, but it doesn’t have to suck. When you and your colleagues are constantly chasing alerts and false alarms then morale and productivity will suffer. So time to get busy with CA Application Performance Management – using techniques  that help align your pain to business pain, coordinate responses and eliminate noise. That way people can learn and develop – gain a healthy work-life balance – even get a decent night’s sleep.