Preventing the Next “Silent Hill” Horror with a New Model for APM

by April 23, 2018

 

As an Australian, I was interested to learn that we antipodeans hold the record for an underground coal fire that’s been burning the longest. Yep, the aptly named Burning Mountain has been ablaze for – wait for it – 6,000 years.

As it turns out coal seam fires are incredibly common, having dire consequences for folks living in close proximity. In 2014, a fire 1,000 miles southwest of Burning Mountain spewed toxic gasses on the unfortunate townsfolk of Morwell for 45 days, while the one raging under Centralia, Pennsylvania in the US since 1962 has forced the town to become all but abandoned – and inspiration for the horror movie Silent Hill.

For many of us in tech we have our own Silent Hill type systems. Creaking technologies underpinning our customer facing software applications. Analogous to coal fires, they wreak havoc on our digital strategies and the people needed to support them. Not least because folks spend most of their time constantly fighting fires and in toxic alert mode.

Underground coal fires are tough suckers to control, but it doesn’t have to be this way in business technology. With advances in instrumentation and monitoring, every element contributing to performance across the tech stack (app to infrastructure; on-premise to cloud) should be made visible and managed in context of the business services they support.

This all sounds captain obvious, but in practice it’s hard to achieve.

Traditionally, IT operations teams have been organized in a stratum type fashion. One team to support each layer – the app, infrastructure, network and so on. Monitoring is aligned accordingly, with discrete tools dedicated to each layer of the stack. And the more components we add, the more tools we acquire – each adding to the seams of silo’d data.

But when fires breakout these data seams can exacerbate the problem. Teams monitoring one layer of the stack may use unarguable metrics (in their eyes and within their domain) in an attempt to deflect the issue. This becomes even worse when teams mask any ills with vanity metrics – like for example, ok our customers can’t place orders, but our availability SLA is humming at 99.99%.

In situations like these it’s not uncommon for data disputes and turf wars to rage at the intensity levels of the fires burning below. Not surprisingly folks get burnt out, morale plummets, and digital business, well, goes up in flames.

What we have then is a data problem, which is kind of tragic given the glut of metrics, logs and transactions under our control. We need better ways to discover, model and visualize this data so that every team (irrespective of tech affinity) can collectively identify business impacting hot spots. So if there’s any shaming to be done, it’s not directed at teams or colleagues, but towards “biztech flakiness” – any Silent Hill apps, cloud services, code, methods, and infrastructure – whatever is burning.

At CA Technologies, we don’t want you to be stuck permanently in fire-fighting mode because that just sucks. With CA APM we’ve delivered assisted triage to quickly pinpoint complex performance problems, but also recognize great functionality such as this must be built upon a dynamic topological data model that correlates app to infrastructure; providing clear and uninterrupted insight into any anomalies and issues that can impact application performance. This is evidenced in our new layer views functionality, which allow teams to quickly visualize and traverse application and infrastructure topologies – all from a single interface. This way, any team or engineer gains the context needed to more purposely identify and attack problems. So if your gig is infrastructure, no problem – you can immediately determine how that high CPU utilization across a container cluster is impacting latency and response times for a critical customer facing web or mobile application.

But beyond immediate problem solving, the dynamic data model supporting CA APM allows DevOps teams to conduct and deliver a far richer set of far-sighted services. It’s these that can help predict and prevent those uncontrollable “Silent Hill” type fires.

For example:

  • Analyzing code and its impact on application performance, so teams can collectively determine which designs, practices and methods correlate to the best business outcomes.
  • By unequivocally determining which infrastructure elements (cloud or on-premise) are repeatedly causing applications performance problems, IT operations have a reliable decision making framework upon which guide refresh or cloud migration strategies – in real-time.
  • Building end-to-end application performance visibility into the delivery pipeline itself, so teams can improve the efficacy of their continuous testing practices – before production.

 

If you’re constantly fighting fires something is burning below you – apps, code, infrastructure, whatever. But unlike tackling an out-of-control coal seam fire, you have the means to gain visibility however complex the application stack. To prevent the business horror story equivalent of Silent Hill, seek out ways to visualize truths that have so far been hidden from view. That begins and ends with embracing a new model in Application Performance Management.