APM monitoring governance: The Jurassic Park conundrum
Just because we can monitor everything doesn’t mean that we should.
One of my favorite quotes comes from the 1993 hit movie Jurassic Park. I’m sure you remember it—Jeff Goldblum’s character, Ian Malcolm, objects to the park creator’s lack of respect for nature and to innovations the park’s genetic engineers have implemented. Ian says, “Your scientists were so preoccupied with whether or not they could, they didn’t stop to think if they should.”
Today, many application performance management practitioners perpetuate the problem Ian defines, believing that simply because they can monitor something, they should. It’s not uncommon for organizations that monitor apps to measure every possible operation, so that they can determine the source of every conceivable problem.
Think of me as the Ian Malcolm of APM monitoring governance: Experience tells me that just because we can monitor everything doesn’t mean we should. Time and again, I’ve stepped in at organizations where overuse or misuse of CA APM or a similar tool resulted in performance degradation of the infrastructure and of monitored apps. Ironically, this performance degradation can be caused by over-monitoring or over-instrumentation.
Extensions, power packs, add-ons, traces, management modules, metric groupings, alerts, calculators and dashboards are all useful in enabling proactive monitoring and troubleshooting, but over-indulging in them can bloat the agent footprint and/or cause a loss of valuable resources like CPU, memory and disk. Attempting to analyze, prioritize and then take action on ‘too much performance data’ can also distract teams.
Without limiting the metrics gathered—or, as Ian Malcolm might say, without any governance—the result is loss of control (although I’ll admit it’s less threatening than a herd of Tyrannosaurus Rexes on the loose). Application performance metrics should be associated with specific actions. When a metric crosses a specific threshold, application and monitoring teams will be more focused and more efficient to assess and remediate the issue.
That’s why monitoring governance is so essential. To harness the power of a tool such as CA APM, without misapplying that power, organizations need to determine which operations are most important to their goals. Governance provides guidance and control by applying a model of logical thought processes and experience to determine what should be monitored. Defining key performance indicators (KPIs) helps an organization determine which metrics best indicate how an app is performing, such as slowness, lack of response, bottlenecks, and loss of resources. It is those metrics that should be monitored.
So if your organization needs a robust set of proactive monitors tuned to deliver timely, actionable performance information, and you don’t need a cacophony of monitors gathering irrelevant data points that contribute to excessive overhead and/or poor infrastructure and application performance (and who needs that, right?), you need to ensure that monitoring governance is in place.
The solution is to plan ahead by identifying KPIs for monitoring apps. Working with your development teams, who know their apps, and with APM experts with the skills and, more importantly, experience in monitoring apps and configuring tools, helps organizations apply good judgment and make good decisions. (One caveat: Developers like to turn every monitor on in a test environment, so make sure they turn off those that aren’t crucial in production.)
Another key component in decision-making is to get input from the organization’s leadership team. Getting these parties to contribute to the process may yield additional reliable metrics tuned to deliver the best possible indication of app performance. You need to know which metrics are most important to them, and to measure only the subset of those that can be corrected if alerted to poor performance. There’s no point in measuring something that isn’t worth fixing or is continually ignored if it goes off-grid.
The right combination of monitoring artifacts, in the proper quantity, will provide accurate and timely performance data that alerts the organization when it needs to act to prevent catastrophic failure. That’s the solution to the Jurassic Park Conundrum.