Friday, February 18, 2011

Incident Tracking and APM Maturity

I've being doing more Architecture Assessments (post-deployment) than Planning Assessments lately and have noticed something troubling. Folks in general have mature incident and trouble management practices for the applications that they operate/manage. Yet they do not do any tracking for any incidents related to the APM solution. What's going on?

I believe this is due to the overall misunderstanding about APM being 'just another monitoring tool'. Some folks think APM looks like many other tools - just with better capabilities. And they treat it just the same - a tool for operations to use when there is a problem; and back on the shelf when things are quiet. This ignores the 24x7 reality of the solution. We already know that this presumption leaves significant gaps in managing the capacity of the metrics storage component.

A small APM initiative can go for years before they run out of capacity and no one is managing this until the solution becomes unstable - and then they realize they have limited understanding. A growing APM initiative will run into this problem more quickly, depending on the pace of their successive deployments. But the pace of deployment is not as significant as much as the absence or presence of incident tracking for the APM solution.

In the absence of incident tracking, the client continues on blindly experiencing instability of the monitoring solution and then will escalate to vendor support, initialling labeling everything as a 'product defect'. The vendor support will then attempt to confirm the 'defect' but after finding nothing wrong (no known incidents, no prior history of instability), you end up with a bit of an impasse: no defect, yet no resolution because the problem is the configuration, not the monitoring software.

Why then is 'incident tracking' so magical? Stability problems don't suddenly happen. They are often the result of a long, slow grind to the point where stability (or performance) is unacceptable. Something happens. No one is quite sure. Reboot a few things - and the problem seems solved! A couple of weeks later, the reboots are more frequent. After a couple of months, the reboots don't seem to have any lasting effect - and a support incident gets opened. Incident tracking captures these seemingly unrelated events and generates a larger perspective. You still may not know what to do but you can see where it is trending. You may start to look for other correlations. You end up with a much better history and timeline of when and how things started going bad - and that will be a big help once it becomes a support incident.

It also makes it easy to confirm the fix is helping, or not, as the frequency of incidents changes.

As to the nature of the instability, and the effort to re-mediate - these are turning out to be real systemic problems - so no easy fixes. Getting Incident tracking re-established for the APM solution - that's easy enough - but the damage has already been done.

How do you track the performance and capacity of your APM solution? When do you think it will "matter"?

No comments:

Post a Comment