Monday, November 30, 2009

The Essence of Triage

[clearing out the drafts folder - Nov 30]

When interpreting performance data for an incident the question often arises as to what should we look at first. For my APM practice I always focus on "what changed" and this is easily assessed by comparing with the performance baseline or signature for the affected application.

But for folks new to APM, and often very much focused on jumping in to look at individual metrics, you can easily get confused but so many metrics will be suspicious. There are some attributes of the application; response times, invocation rates, garbage collection, CPU, that will be out of normal. And folks will bias their recommendations as to which avenue to explore based on the experience they have with a particular set of metrics.

My approach to this is pretty basic: go for the "loudest" characteristic first. Much like "the squeeky wheel gets the oil" - the "loudest" metric is where you should start you investigation, moving downhill. More importantly, you need to stop looking once you have found a defect or deviation from your expectations, and get it fixed and then look again for the "loudest' metric.

This is important because the application will re-balance after the first problem is addressed, and you will get a new hierarchy of "loud" metrics to consider.

For example, let's assert a real-world scenario where there are two servlets, each of which is accessing a separate database. Use Case A accesses Servlet A, which is accessing an RDBMS and has a query response time of 4 seconds. Servlet A has a response time of 5 seconds. Use Case B has a response time of 2 seconds, accessing Servlet B and makes a query via messaging middleware to a mainframe HFS, which takes 1 second. Which of these is the loudest problem?

If you feel that a servlet response time of 5 seconds is a pretty good clue. You would be wrong. Sure, everyone should know that a servlet response time should be on the order of 1-3 seconds. And certainly being able to compare this performance to an established baseline would confirm it.

Instead, we will limit our consideration to the Use Case which has actually has users complaining, which in this case is Use case B.

"Wait a minute! You didn't tell us which use case had users complaining!"

Right. And neither will your users (real-world scenario)! What I'm driving at is that you can't know where to look until you know what has changed. And you can't know what has changed unless you have a normal baseline with which to compare. It's always nice when you have an alert or user complaint to help point you in the right direction but that can be unreliable as well.

For all this I prefer what I call the "Dr. House" model. Dr. House is a TV show what House draws out the root cause for troublesome medical cases he and his team are involved with. I think it's a great model for triage and diagnosis of application performance. One of the axioms of House's interactions with patients is that "everybody lies", when they present their medical history.

This is how I approach triage - everybody lies (or in corporate-neutral language: inadvertently withholds key information). So I base much of my conclusions as to how to proceed and what to look for, based on what I can expose by comparing the current activity, to 'normal' - or whatever baseline I can develop.

No comments:

Post a Comment