Monday, November 30, 2009

The Essence of Triage

[clearing out the drafts folder - Nov 30]

When interpreting performance data for an incident the question often arises as to what should we look at first. For my APM practice I always focus on "what changed" and this is easily assessed by comparing with the performance baseline or signature for the affected application.

But for folks new to APM, and often very much focused on jumping in to look at individual metrics, you can easily get confused but so many metrics will be suspicious. There are some attributes of the application; response times, invocation rates, garbage collection, CPU, that will be out of normal. And folks will bias their recommendations as to which avenue to explore based on the experience they have with a particular set of metrics.

My approach to this is pretty basic: go for the "loudest" characteristic first. Much like "the squeeky wheel gets the oil" - the "loudest" metric is where you should start you investigation, moving downhill. More importantly, you need to stop looking once you have found a defect or deviation from your expectations, and get it fixed and then look again for the "loudest' metric.

This is important because the application will re-balance after the first problem is addressed, and you will get a new hierarchy of "loud" metrics to consider.

For example, let's assert a real-world scenario where there are two servlets, each of which is accessing a separate database. Use Case A accesses Servlet A, which is accessing an RDBMS and has a query response time of 4 seconds. Servlet A has a response time of 5 seconds. Use Case B has a response time of 2 seconds, accessing Servlet B and makes a query via messaging middleware to a mainframe HFS, which takes 1 second. Which of these is the loudest problem?

If you feel that a servlet response time of 5 seconds is a pretty good clue. You would be wrong. Sure, everyone should know that a servlet response time should be on the order of 1-3 seconds. And certainly being able to compare this performance to an established baseline would confirm it.

Instead, we will limit our consideration to the Use Case which has actually has users complaining, which in this case is Use case B.

"Wait a minute! You didn't tell us which use case had users complaining!"

Right. And neither will your users (real-world scenario)! What I'm driving at is that you can't know where to look until you know what has changed. And you can't know what has changed unless you have a normal baseline with which to compare. It's always nice when you have an alert or user complaint to help point you in the right direction but that can be unreliable as well.

For all this I prefer what I call the "Dr. House" model. Dr. House is a TV show what House draws out the root cause for troublesome medical cases he and his team are involved with. I think it's a great model for triage and diagnosis of application performance. One of the axioms of House's interactions with patients is that "everybody lies", when they present their medical history.

This is how I approach triage - everybody lies (or in corporate-neutral language: inadvertently withholds key information). So I base much of my conclusions as to how to proceed and what to look for, based on what I can expose by comparing the current activity, to 'normal' - or whatever baseline I can develop.

Wednesday, November 25, 2009

Yikes - project crashing!

My executive sponsor has bailed on the project. No details yet as to why. I am bummed. Sure, just need to sign up a fresh one but time marches on...

Tuesday, November 24, 2009

What You Should Know... part 2

Well, this part was less annoying.
  • Application lifecycle - includes development. Tru-dat my brother! Too bad the authors example is of a code-profiler. And yes, most APM-savvy folks do not include development as part of APMonitoring. But if you really want to improve app performance (and the end-user experience) - you need cooperation from development.
  • The SOA flag - This is not so bad. He inserts that ASM acronym again. But otherwise, this is accurate and helpful. You know, if you really think that APM is overloaded (which it is) - how about SPM - Service Performance Management. Everybody knows "ASM" means an assembler directive anyway!
And then a few other points - and the rest is an ad for a MicroSoft product. And now all the spin makes sense - a desktop view of the world IT management problem. Not so innovatative. Read at you own risk.

All in all, what can I say about this author? He has 30 books published. I have a book proposal. So I am crap.

But. He writes about APM. I "do" APM. I help clients realize APM. I "talk the talk" AND "walk the walk".

The author is "two-thousand and LATE"! ;-)

Now, if I only had a book...

What You Should Know About Application Performance management

This is from RealTime Nexus - The Digital Library for IT Professionals. Do your own search - I will not spoil my page with a link to it! Let me opine on the salient points:
  • Makes a distinction for APMonitoring (APM == Monitoring) and somehow reserves the assignment of thresholds to Health Monitoring, as different from Performance Monitoring, even suggesting "AHM", as a new acronym... but then realizing that APM is already well accepted. I think it would be more accurate to acknowledge that APM tools don't do the "Health Monitoring" - OOTB, but I would submit that APM processes would address this - especially as I already use a "HealthCheck" process as part of the APManagement best practices.
  • Asserts that "end-user performance" is the primary metric for APM and acknowledges that the "other metric may be involved... to provide troubleshooting details. " This is too much of a plug for a specific vendor solution. Sure end-user experience is what performance management is all about but it is a little naive to assert that it is the main focus. There is a huge benefit in using APM across the application lifecycle, and especially before the end-user experience can even be measured (development, testing). Not to mention the significant number of applications that do not have even have a user front-end to measure!
To preserve vendor neutrality, I instead prefer to focus on the "business transaction" - the end-2-end flow across all participating components. This puts the end-user piece in the proper perspective: simple apps are predominately end-user-centric, enterprise apps.. are not.

In assessing the utility of an APM initiative, the focus is always on the high-value transactions - end-user-related or not. Then when you know what really matters, you select the appropriate technology. That way, you do end up trying to use end-user monitoring on a CICS transaction, nor using byte-code instrumentation to monitor a print server. ;-)

  • How does APM work -- I was nestling in for a good read here - but was disappointed. Regarding the types of information that an APM tool can take advantage, the author describes the following:

"Application-level resource monitors, if any exist. These have to be specifically created and exposed by the application author."

I guess they never heard of BCI (Byte Code Instrumentation) - which does this automatically. And have impuned JMX and PMI technologies - which do the right job for configuration information of the application server - which is what I'm hoping the author really meant. JMX and PMI require the developer to code for their use. Always was and always will - and an expensive proposition at development. But BCI automatically determines what is interesting to monitor, much more effectively that JMX or PMI - and at runtime (aka - late-binding). But if the data is already there - we take direct advantage of it.
  • Downsides of APM -- This is annoying because it is a grain of truth buried in a FoxNews-spun positioning. Sure, packaged apps are hard to monitor but this is because they are closed and usually provide their own monitoring tools. They may not be best of breed - but it's a start. Ratified packaged vendors will actually embeed APM technologies within their offerings, and some require a vendor-specific extension - for sure, the industry is lagging a bit here - and that's always a problem with proprietary solutions. It is not an APM problem.
Then the author raises some FUD about virtualized environments... Does he know that an LPAR is a thirty year old virtualized environment? That running in a container, and making accurate performance measurements is old news? Methinks the vendor he is shilling for cares not for mainframe. or BCI, or managing the lifecycle. Or he is some sort of APM Luddite that hasn't looked around much since 2002.
  • Application Service Management (ASM) -- this is the point that set me off and motivated me to opine this detailed review. The author creates a new acronym - cool. I do it all the time. No crime here. But "ASM takes a more business-oriented view." - yikes! ASM focuses on the operating system and platform metrics... I though that's NOT the business view. And then the author acknowledges that the APM/ASM differences are "semantic" - and you should never get caught up in that. Frankly, this is the same tact that creationists take when they assert that "... evolution is a theory that is still being considered...". Dude!!!
And now a link to two more parts... I am a sucker for pain! ;-)

Wednesday, November 18, 2009

Legal approved the project, with caveats

With little fanfare the first major hurdle has been surpassed. The caveats were as expected: no discussion of product technology, no vendor or client names, and no vendor bias - what we call vendor-neutral. The last point has been the real 'chestnut' for the marketing folks with past projects. They want to control the product message and spin things in their direction. I've always been of the mind that the only real solution was to stay vendor and technology neutral. I've been doing it this way for years.

As I've been sticking to "vendor-neutral" as the dominant theme for the APM practice, over the last (4) years and it has clearly been the "path least traveled". If there is nothing to leverage, the marketing arm has generally been indifferent to the program but sometimes it would be a little more pointed. The name of the program, for instance, went through a number of iterations, each about (9) months apart, and ending with a pronouncement and re-branding of all the materials - and then silence. And when we did get some marketing support for a brochure or web link, you really had to hunt it down.

I had last year an MBO to develop sales training materials, get it recorded and up on our internal training site. Everything was under "severe urgency" and when it was finished, no notice of the update was allowed and no edit to make it part of the presales learning plan - which is really the only way folks will spend a couple hours doing training. Sure, it was a month late but I had some family problems. So here it is a full year later and folks are still surprised that they can find it.

So just three days later, after the blessing by legal, I'm now finding that the Project PM is questioning the very premise of re-purposing the root documents for the practice, into book form, and that call will not happen until December 4th. This will suck!

Call me paranoid but I think the old product-centric nemesis is back again! - preserve the status quo, it's not our business anyway, it will only help the competition survive another year...

I did a best practice paper a couple years back focusing on memory management and going way more than the product doc to really show folks how to use it correctly and, more importantly, how to back out of problems when it was used incorrectly. A year as a proposal and a cherry of a project (from a technical perspective), I pick up the project (original lead back off the project) and finished in 3 months. Multiple reviewers and probably my best work at the time. Then it went to marketing for approval to publish and sat there for (9) months. Then it was released, with no edits at all or commentary. My inside sources said the issue was that marketing could not accept that someone could misconfigure the product. Sounds plausible - but folks blown stuff up all the time, for any software system. It seems cruel that you would not show them how to back out gracefully - but that's just my opinion. Anyway, I never sent any more material up to marketing. And now that decision is back to bite me - and with 4 years of constant writing, without any marketing oversight, will they actually review it? Shoot the project? Or throw up their hands and yield to the marketplace?

My original response to management, back in August when the book was 'commissioned', was that marketing would wake up and block a book so why not look at just dropping the fame and glory and go with the Google Knol? We publish what we want (with blessing by legal), the world is made a better place - and I can go on with the next big thing. I still think that is the best route, especially as my optimism wanes.

Maybe it will be better after some turkey?