Wednesday, December 16, 2009

Oblicore - Service Level Management

Notes from the public (registration required)online seminar 16 December 2009, 11:00 AM EST:

Deming (father of quality control) - Plan, Do, Check (measure), Act

SLA can mean many things - ITIL, ERP, CRM, COBiT - folks think they are talking about the same things but meaning varies, depending on where they are in the organization.

"When you have no destination - Every road leads nowhere". You need to align business needs with targets.

- An SLA is a contract between a provider and a customer. Service Level Agreement, Operational Level Agreement, Underpinning Contract. They all document targets and specifies responsibilities of the parties.

- SLM -- ITIL Service Level Management. Tells you what to do (but necessarily how to do it)
- Define, document, agree, monitor, measure, report and review the level of IT services provided.

Most companies manage SLAs manually. Each contract negotiated separately. Labor intensive data collection. Reporting is reactive.

- Establish - Implement - Manage - Review:

Service Improvement Plan: Define Strategy, Planning, Implementation, Catalog Services, Draft, Negotiate, Review SLAs SCs and OLAs, Agree, measure and monitor, Report, Review SLM process, Review SLAs SCs and OLAs (back to start of loop).

( lots about roles for the above steps...)

- Standardizing Contract Creation and Revision - Data collection and metrics need to be uniform.
-- most SLAs require multiple data sources. Need to aggregate the data, apply exceptions (TOD, external events), correlations with other incidents.

(more on types of pain when a robust process for SLA management is absent)

Oblicore does the last mile to drive an automated ITIL approach, via a double closed-loop process. (Geeze - this is buzzword heavy!).

(demo of Oblicore Guarantee) UI is browser-based, multiple tabs. Basically workflow management with a federated view of multiple metrics, over a various web-based forms. following the ITIL model. Generates paper! But somebody has to sign this stuff - so, way better than excel... They use adapters to bring in various metrics. Looks like an ETL transform (Table/Field assignment). Don't know how real-time this could be.

Friday, December 4, 2009

Storm Clouds are Clearing

Well the word is that we are refitted with an appropriate executive sponsor and a status call is scheduled for Monday 7 December. I'm still expecting a few bumps in the road.

Review - Symantec I3 - A Performance Management Methodology

[Clearing out the Drafts folder (Dec 4)]
http://silos-connect.com/solutions/i3methodologybook.pdf

This has a 2005 copyright - so I'm not expecting to get anything too useful here. But it does show up when you are searching on "APM Methodologies".

Introduction - Poor app performance translate into poor end-user experience.

Chapter 1 - Defines incidents as "performance red zones" - where the system fails to meet performance goals, and alludes that these may also be used to define SLAs. They break down incidents into four major types, and I've added some more conventional definition of what they had in mind:
  1. Design and Deployment Problems (Design, Code, Build, Test, Package, Deploy)
  2. System Change Problems (Configuration mistakes due to tuning or attempted fix)
  3. Environment Change Problems (unexpected load or usage patterns)
  4. Usage Problems (Prioritized Access to Shared IT Resources)
The summary of that section presents that:
"Performance management comprises three core activities, reactive, proactive and preventative. Symantic i3 provides the techniques and the tools that are necessary to carry out these activities using a structured, methodical, and holistic approach."
As much as the core activities are accurate, there is nothing in chapter 1 that supports that summary.

Chapter 2 - provides the substance for the core activities described in the summary from chapter 1.
  • Reactive Management - maximize alertness and minimize time-to-detection of problems, equip staff with the right tools to minimize time-to-resolution. I guess lot's of caffine is one of the tools!
  • Proactive Management - "close monitoring" will often detect that problems are likely to occur. I guess you need to be in the same room as the monitoring...
  • Preventative Management - Minimize performance problems in the first place by employing periodic health checks and resolve problems before they get out of hand.
  • Hybrid Approach - Combine all (3) management styles, with feed-forward and feed-back, leading to an overall decline in firefighting situations.
I don't mind where they ended up (hybrid approach) but it seems very operations-centric, which is unusual because the tool itself is developer-focused - and I've never met anyone who could use i3 in operations. But phrases like "maximize alertness" and "close monitoring" suggest that the contract writer didn't know that much more about monitoring than the marketing dude who approved this whitepaper.

Reactive Management IS jumping to action when an incident occurs. But the call to action comes from an alert and the problem with alerts is that they very often don't bubble up to a Trouble Management console until some 15-30 minutes after the incident started. So somehow being closer to the keyboard doesn't really add any value - you are still the last to know. When your helpdesk is a more timely indicator of application health, than your monitoring solution - this is when you know you have a major visibility deficit.

Likewise, defining proactive as more frequent monitoring, in order to close that 15-30 minute gap is also a disservice. The operations team is simply not in a position to be constanting reviewing the key metrics for hundreds of applications under their watch. They rely on alerts to filter out the application noise (all those squiggly traces) and indicate which app is having problems. It might surprise you to also realize that operations has exactly two responses to an alert: 1. Restart the app, or 2. open a bridge call, and restart the app. That's 95% of modern IT today and that's the real problem.

I define proactive management as preventing the problem from getting deployed in the first place - not simply responding to an alert more quickly. The writer assigns this instead to preventative management, so maybe it is only a semantic (funny, no?) difference. But if they mean the periodic healthchecks as something that is occurring in operations, then this is a fantasy. Reverse-engineering the normal performance characteristics from an operational application is a massive task. Remember, we are not talking about the home web server - we are talking about a commercial, enterprise environment with hundreds of applications. That's really not the role for operations to undertake and decide. In order to make the problem tractable, the app owners have to pass judgment on what is normal, and what is abnormal for their app. In reality, they are the only ones who have a chance at understanding.

Regards the hybrid approach, that's something useful but only if we are feeding operational information back to Dev and Qa, in order to improve the initial testing of the app. And feeding forward results from QA, Stress and Performance, that can be used to set initial alert thresholds and maybe some info as to what known problems might actually look like, hopefully in terms of run documentations and/or dashboards.

The dashboard metaphor is key because no one has any time to figure out where the logs are for a given app. You need a mechanism to present details about the app, along with a list of who to call and what some of the resources involved might be.

Chapter 3 - Symantic Methodology - Process and Tools. Oops - I prefer People, Process and Product (tools). Must be some Borg over there at Symantec... And they claim the effectiveness of their "proven process" - but they haven't actually defined it yet... and then we are on to the products Insight, Indepth and Inform. And that's all for that section.

Performance Management Stages include detect (symptoms), find (source), focus (root-cause), improve (follow steps to improve) and verify (verify). I call that Triage and Remediation. My first step is not detect but refute! I prefer a less adversarial approach by first intoning "how do we know we have a problem?" It sometimes makes for some uncomfortable moments of silence.

Think about it. Someone notifies you that there is an urgent issue. How do you know that they are accurate? What do you look at to confirm that the issue is actually an incident? This is an important bit of process because the stages of explanation can become something consistent, like a practiced drill that can be executed when stress is high and patience is short. When everybody is used to it, it is actually quicker to review what's right and what is apparently wrong. And you need a steady soul to initiate this and keep it on track - but that's what effective triage is really about - deliberate, conclusive steps until something is found out of place. Not as much fun as running around with your hair on fire but a whole lot more effective and predictable.

One of the big complaints I hear from operations is that alerts are dropping in all the time. It's not just too many alerts and their frequency. It often means too many alerts for which nothing could be found indicating a problem. False-positives. Not actionable. Ultimately these are due to defects in visibility - and something that no amount of pressure and screaming can resolve. And there is an important point skipped over - the nature of alerts. Most alerts are that a system is unavailable - it has gone down or is unreachable. The supposed "pro-active" alerts - these are different because they have thresholds defined. It's not unfair to suggest that alerting on a threshold is more visibility than a "gone-down" alert - but it certainly isn;t "proactive" - it's using a threshold to define the alert. Duh! But how do you arrive at the threshold? What metrics do you select?

As the writer revisits the different management style, they point out that the mechanisms to realize the process improvements are part of the product (tools). Well, that's cool but what I really like to focus on are the product-neutral processes - the things I should be doing no matter whose tool set is in play. Sure, it's nice that you have an automated mechanism to periodically do performance reviews. What do you look at? What metrics are important? What changes are significant? How does this relate to what the operations team is needed to be more effective.

Process is something that is easy to wave around. Automated processes sound even better. What the process is and how it relates to the current organization and capabilities - I don't think the Symantec tool has any concept of that. And that is the gap that limits adoption and, ultimately, the utility of the tool.

In summary, the Symantec methodology is detect, find, focus, improve, and verify. The different products (tools) implement and automate these processes. I guess that could be anything.

Section 2 - devotes a chapter (chapters 4-8) to each step of the methodology. Performance Reviews are the mechanism: the periodic health check. This requires (4) types of reports: Top-N (response times), Trends and Exceptions. Apparently, there are only (3)! I guess the invocations Top-N report is implied. The trends are just a historical view of a metric - no magic here! The exceptions are an historical view of exceptions and errors - also not magical. There should be a status or overview fo the environment - again, no magic needed.

As the author moves back to Reactive Management, they introduce the concept of resource monitoring - databases, messaging, etc. These need specialized views (or tools), and no surprise here but can't we be "proactive" for resource management as well?

Monday, November 30, 2009

The Essence of Triage

[clearing out the drafts folder - Nov 30]

When interpreting performance data for an incident the question often arises as to what should we look at first. For my APM practice I always focus on "what changed" and this is easily assessed by comparing with the performance baseline or signature for the affected application.

But for folks new to APM, and often very much focused on jumping in to look at individual metrics, you can easily get confused but so many metrics will be suspicious. There are some attributes of the application; response times, invocation rates, garbage collection, CPU, that will be out of normal. And folks will bias their recommendations as to which avenue to explore based on the experience they have with a particular set of metrics.

My approach to this is pretty basic: go for the "loudest" characteristic first. Much like "the squeeky wheel gets the oil" - the "loudest" metric is where you should start you investigation, moving downhill. More importantly, you need to stop looking once you have found a defect or deviation from your expectations, and get it fixed and then look again for the "loudest' metric.

This is important because the application will re-balance after the first problem is addressed, and you will get a new hierarchy of "loud" metrics to consider.

For example, let's assert a real-world scenario where there are two servlets, each of which is accessing a separate database. Use Case A accesses Servlet A, which is accessing an RDBMS and has a query response time of 4 seconds. Servlet A has a response time of 5 seconds. Use Case B has a response time of 2 seconds, accessing Servlet B and makes a query via messaging middleware to a mainframe HFS, which takes 1 second. Which of these is the loudest problem?

If you feel that a servlet response time of 5 seconds is a pretty good clue. You would be wrong. Sure, everyone should know that a servlet response time should be on the order of 1-3 seconds. And certainly being able to compare this performance to an established baseline would confirm it.

Instead, we will limit our consideration to the Use Case which has actually has users complaining, which in this case is Use case B.

"Wait a minute! You didn't tell us which use case had users complaining!"

Right. And neither will your users (real-world scenario)! What I'm driving at is that you can't know where to look until you know what has changed. And you can't know what has changed unless you have a normal baseline with which to compare. It's always nice when you have an alert or user complaint to help point you in the right direction but that can be unreliable as well.

For all this I prefer what I call the "Dr. House" model. Dr. House is a TV show what House draws out the root cause for troublesome medical cases he and his team are involved with. I think it's a great model for triage and diagnosis of application performance. One of the axioms of House's interactions with patients is that "everybody lies", when they present their medical history.

This is how I approach triage - everybody lies (or in corporate-neutral language: inadvertently withholds key information). So I base much of my conclusions as to how to proceed and what to look for, based on what I can expose by comparing the current activity, to 'normal' - or whatever baseline I can develop.

Wednesday, November 25, 2009

Yikes - project crashing!

My executive sponsor has bailed on the project. No details yet as to why. I am bummed. Sure, just need to sign up a fresh one but time marches on...

Tuesday, November 24, 2009

What You Should Know... part 2

Well, this part was less annoying.
  • Application lifecycle - includes development. Tru-dat my brother! Too bad the authors example is of a code-profiler. And yes, most APM-savvy folks do not include development as part of APMonitoring. But if you really want to improve app performance (and the end-user experience) - you need cooperation from development.
  • The SOA flag - This is not so bad. He inserts that ASM acronym again. But otherwise, this is accurate and helpful. You know, if you really think that APM is overloaded (which it is) - how about SPM - Service Performance Management. Everybody knows "ASM" means an assembler directive anyway!
And then a few other points - and the rest is an ad for a MicroSoft product. And now all the spin makes sense - a desktop view of the world IT management problem. Not so innovatative. Read at you own risk.

All in all, what can I say about this author? He has 30 books published. I have a book proposal. So I am crap.

But. He writes about APM. I "do" APM. I help clients realize APM. I "talk the talk" AND "walk the walk".

The author is "two-thousand and LATE"! ;-)

Now, if I only had a book...

What You Should Know About Application Performance management

This is from RealTime Nexus - The Digital Library for IT Professionals. Do your own search - I will not spoil my page with a link to it! Let me opine on the salient points:
  • Makes a distinction for APMonitoring (APM == Monitoring) and somehow reserves the assignment of thresholds to Health Monitoring, as different from Performance Monitoring, even suggesting "AHM", as a new acronym... but then realizing that APM is already well accepted. I think it would be more accurate to acknowledge that APM tools don't do the "Health Monitoring" - OOTB, but I would submit that APM processes would address this - especially as I already use a "HealthCheck" process as part of the APManagement best practices.
  • Asserts that "end-user performance" is the primary metric for APM and acknowledges that the "other metric may be involved... to provide troubleshooting details. " This is too much of a plug for a specific vendor solution. Sure end-user experience is what performance management is all about but it is a little naive to assert that it is the main focus. There is a huge benefit in using APM across the application lifecycle, and especially before the end-user experience can even be measured (development, testing). Not to mention the significant number of applications that do not have even have a user front-end to measure!
To preserve vendor neutrality, I instead prefer to focus on the "business transaction" - the end-2-end flow across all participating components. This puts the end-user piece in the proper perspective: simple apps are predominately end-user-centric, enterprise apps.. are not.

In assessing the utility of an APM initiative, the focus is always on the high-value transactions - end-user-related or not. Then when you know what really matters, you select the appropriate technology. That way, you do end up trying to use end-user monitoring on a CICS transaction, nor using byte-code instrumentation to monitor a print server. ;-)

  • How does APM work -- I was nestling in for a good read here - but was disappointed. Regarding the types of information that an APM tool can take advantage, the author describes the following:

"Application-level resource monitors, if any exist. These have to be specifically created and exposed by the application author."

I guess they never heard of BCI (Byte Code Instrumentation) - which does this automatically. And have impuned JMX and PMI technologies - which do the right job for configuration information of the application server - which is what I'm hoping the author really meant. JMX and PMI require the developer to code for their use. Always was and always will - and an expensive proposition at development. But BCI automatically determines what is interesting to monitor, much more effectively that JMX or PMI - and at runtime (aka - late-binding). But if the data is already there - we take direct advantage of it.
  • Downsides of APM -- This is annoying because it is a grain of truth buried in a FoxNews-spun positioning. Sure, packaged apps are hard to monitor but this is because they are closed and usually provide their own monitoring tools. They may not be best of breed - but it's a start. Ratified packaged vendors will actually embeed APM technologies within their offerings, and some require a vendor-specific extension - for sure, the industry is lagging a bit here - and that's always a problem with proprietary solutions. It is not an APM problem.
Then the author raises some FUD about virtualized environments... Does he know that an LPAR is a thirty year old virtualized environment? That running in a container, and making accurate performance measurements is old news? Methinks the vendor he is shilling for cares not for mainframe. or BCI, or managing the lifecycle. Or he is some sort of APM Luddite that hasn't looked around much since 2002.
  • Application Service Management (ASM) -- this is the point that set me off and motivated me to opine this detailed review. The author creates a new acronym - cool. I do it all the time. No crime here. But "ASM takes a more business-oriented view." - yikes! ASM focuses on the operating system and platform metrics... I though that's NOT the business view. And then the author acknowledges that the APM/ASM differences are "semantic" - and you should never get caught up in that. Frankly, this is the same tact that creationists take when they assert that "... evolution is a theory that is still being considered...". Dude!!!
And now a link to two more parts... I am a sucker for pain! ;-)

Wednesday, November 18, 2009

Legal approved the project, with caveats

With little fanfare the first major hurdle has been surpassed. The caveats were as expected: no discussion of product technology, no vendor or client names, and no vendor bias - what we call vendor-neutral. The last point has been the real 'chestnut' for the marketing folks with past projects. They want to control the product message and spin things in their direction. I've always been of the mind that the only real solution was to stay vendor and technology neutral. I've been doing it this way for years.

As I've been sticking to "vendor-neutral" as the dominant theme for the APM practice, over the last (4) years and it has clearly been the "path least traveled". If there is nothing to leverage, the marketing arm has generally been indifferent to the program but sometimes it would be a little more pointed. The name of the program, for instance, went through a number of iterations, each about (9) months apart, and ending with a pronouncement and re-branding of all the materials - and then silence. And when we did get some marketing support for a brochure or web link, you really had to hunt it down.

I had last year an MBO to develop sales training materials, get it recorded and up on our internal training site. Everything was under "severe urgency" and when it was finished, no notice of the update was allowed and no edit to make it part of the presales learning plan - which is really the only way folks will spend a couple hours doing training. Sure, it was a month late but I had some family problems. So here it is a full year later and folks are still surprised that they can find it.

So just three days later, after the blessing by legal, I'm now finding that the Project PM is questioning the very premise of re-purposing the root documents for the practice, into book form, and that call will not happen until December 4th. This will suck!

Call me paranoid but I think the old product-centric nemesis is back again! - preserve the status quo, it's not our business anyway, it will only help the competition survive another year...

I did a best practice paper a couple years back focusing on memory management and going way more than the product doc to really show folks how to use it correctly and, more importantly, how to back out of problems when it was used incorrectly. A year as a proposal and a cherry of a project (from a technical perspective), I pick up the project (original lead back off the project) and finished in 3 months. Multiple reviewers and probably my best work at the time. Then it went to marketing for approval to publish and sat there for (9) months. Then it was released, with no edits at all or commentary. My inside sources said the issue was that marketing could not accept that someone could misconfigure the product. Sounds plausible - but folks blown stuff up all the time, for any software system. It seems cruel that you would not show them how to back out gracefully - but that's just my opinion. Anyway, I never sent any more material up to marketing. And now that decision is back to bite me - and with 4 years of constant writing, without any marketing oversight, will they actually review it? Shoot the project? Or throw up their hands and yield to the marketplace?

My original response to management, back in August when the book was 'commissioned', was that marketing would wake up and block a book so why not look at just dropping the fame and glory and go with the Google Knol? We publish what we want (with blessing by legal), the world is made a better place - and I can go on with the next big thing. I still think that is the best route, especially as my optimism wanes.

Maybe it will be better after some turkey?

Wednesday, October 28, 2009

Compuware - The Definitive Guide to APM

This is an web book from RealTime publishers nexus.realtimepublishers.com Currently, only half of the chapters are delivered, so here is what it is!

The Good: Emphasis on the process gap and organization maturity as the real barriers to APM success. It includes small vignettes of IT life at the start of each chapter, to highlight to IT situation and challenges. They even define the "M" of APM as Management, not Monitoring. They avoid the word "Dashboards" in favor of visualizations (cool: brings reports back to the table) and mention "application lifecycle" a few times, like "APM optimizes the application lifecycle". And, my favorite: "measuring performance is a proxy for understanding business performance".

The Bad: It purports to "take you through an entire implementation" but doesn't offer any depth. It is more of an extended whitepaper. It does cover the lifecycle of the motivation, decisions and implementation of an APM solution but as a conversation of what could be done. They only acknowledge stakeholders as SysAdmins, Developers and End-users (the guy using the browser)

The revelation for me was that they dredged up a Gartner Maturity Model from 2003 that had some interesting contrasts with the model we derived from our internal analysis of implementation failures and successes. Gartner identified management maturity as "Chaotic, Reactive, Proactive, Service and Value". Our model allowed for "Reactive, Directed, Proactive, Service-Driven and Value-Driven", with Reactive further divided as Reactive-Negotiated and Reactive-Alerting. I don't get too much access to Gartner stuff but here is how I represented Management Maturity:

This is from the first ICMM positioning around 2005. I don't recall why I decreased the size of each box, from left to right. I think I was trying to emphasize efficiency or proportion of IT organizations that might be found practicing at that level.

"Directed" is using APM metrics to influence the application lifecycle, focusing on QA practices.

"Service-driven" - everybody has this goal, we only tried to put teeth into it by associating it the definition of best practices.

I really liked this slide but it is not the emphasis we have today. We focus more on "visibility" because it provides a more "joining" context among the various tools that are available and their contributions, rather a focus on a particular technology and excluding all others. We also highlight the impact that visibility has on the existing processes in an organization and how this helps us assess their maturity and make meaningful recommendations for remediation.

Management Maturity and the processes that go along with it are the foundation of the Compuware APM message. That's not so bad. But then they fall into some unusal partitioning of APM in order to highlight their transaction monitoring technology. So I conclude that while they are saying the right things they regrettably recast APM as something that their technology delivers - and you only need to look at transactions.

Process re-engineering is actually pretty difficult and for well-established and reasonably successful, Reactive-Management organizations, this message is just way too hollow. These prospects know a lot about processes and may even know where they have gaps. What they need instead is a plan to help them evolve the organization, consistent with APM. What can they do now, and tomorrow, (and without purchasing new technology) that will get them on track to APM? What will derail their efforts? How will they know they have improved the situation? When will investment help accelerate their drive to APM?

Like Neo (Matrix) - Stuck on the subway platform

Still waiting on the proposal acceptance. Editing assistance contract, version two, is off to legal. Some business challenges and a bit of reorganization have pulled attention away from the project. Thankfully, the marketing folks tripped over an APM book project from Compuware so now they are beating the drum to help get the oars moving on this boat!

I've started outlining sections on Triage and Maturity Models. These are both pretty big topics and left out of the initial outline but it might be useful to have them ready in case the 'editing' decimates the current block of content.

Currently, I'm at 30k words or 60% of my initial target but really didn't make much progress in October. All the ducks are lined up, in terms of the process obligations. It feels like my flight home is cancelled and I'm waiting for a schedule update - no meeting at risk - just my own time!

Saturday, October 3, 2009

One Writer, Dozens of Editors

Not quite back from vacation I foolishly moved my crackberry from stun, to loud, and shortly thereafter received an invite to join an unscheduled (for me) conference call with the internal publishing group. I was a couple minutes late to the call but quickly learned that there is a lump of process to follow, none of which I had been aware of earlier. Yikes!

I'm about 50% complete with the first draft, of which such status I share when asked. I never expected to hear "Stop right now! You can't start writing until your outline is approved!" She was serious! Apparently, they expect he process to be followed and there was some concern that I would be wasting time writing now, when the outline my very well be changed.

I don't see this as much of a concern. Frankly, I would prefer to cut-n-paste from too much material, rather than come up short and have to write new material from scratch. The whole purpose of this projects was to re-purpose the work I had already done, in presentation and white paper form, and bring it all together into a single, cohesive book experience. So I'm wondering what the editors will come up with, to strengthen or reduce the outline. So far, some internal reviewers of the outline have simply determined that "It looks good. Get-er done!". My simple goal is to not be the bottleneck in this part of the process - the writing. Whatever it takes to get it into book form - I don't have any control over that. But metering the pace of writing - that doesn't seem very sportsmanlike.

But it was concluded that I needed to prepare and submit a book proposal, which would include a detailed outline, resume, and a whole lot of other stuff that the marketing folks will need when this project goes to press. And it needed to be done ASAP. So I got it together, updated my resume (which I hadn't done for over six years) and pushed out the goods Monday night.

Silence.

So far.

On the contractor front, that seems to be going well. I've chatted with my so-to-be-anointed collaborator/editor/I've_done_books-before dude. We have the first draft of their services proposal. I have that into our internal team for review and comment. The 'hot potatoes' are off in other hands. And I keep chugging along.

Part of the book proposal application was a couple of questions along the due diligence theme. Has someone else covered this topic? What publications have a similar theme? What makes your book different.

I took a moment to pause and reflect. This could be a useful bit of exercise. I've always popped off a google search, from time to time, to see if anyone has done something significant for APM and never came back with much. But them I was a little bit more biased towards performance of distributed architectures and not simply Application Performance Management.

After a concentrated effort, I starting finding a few leads. I had a moment of panic - "Crap! Has someone beat me to the punch?" But on further review, a couple of good ideas but not at all the message I'm planning to drop on the Industry. So what am I finding? I'll lay that out over the next couple blog entries.

Tuesday, September 22, 2009

Funding is confirmed

Off on vacation I received an email that funding was to be approved. I'll get the details when I get back to work next week. I immediately sent off a notice to the editor candidate as we still have to do a final interview. I think everything will be fine but you just can't know until share the plan, goals and concerns.

Saw some corporate messaging around PCMM - People Capability and Maturity Model, as part of an HR initiative. Maybe influenced by the earlier ICMM - Instrumentation Capability and Maturity Model... but that might be too bold an assertion. Still nice to see some parallel efforts to document the skills, process and competencies appropriate for a body of experiences.

Before vacation, I got pulled into a firefight of a stability/scalability issue. No QA testing was possible, so we went into production with a default configuration. With the holiday weekend, we got 5 days of data before the issue was reproduced. This is an excellent scenario as the application team was unaware that monitoring was deployed so far in advance, so they immediately suggested that "probably the monitoring was the culprit...". The bridge call got quiet quick when they learned that we had been running without incident for 5 days already. Then the monitoring team dug right into the heart of the matter - the load had tripled. Was this something that was tested? More silence on the call. Simply put, the app cannot scale to the load that was delivered.

We also had a CritSit team from another major vendor who were proposing getting some instrumentation in place, checking logs, etc. - they would need about a week. While that monologue unfolded, the monitoring team generated a report with the bulk of that info and circulated via email. The critsit team scurried to open up that pdf, we reviewed it a bit and then they realized that we had already everything they were proposing. "So can we all agree to start looking for the bottleneck?" For me, that means time to come off site. The job is done and vacation awaits!

At the onset of this engagement, there was considerable pressure to deploy all kinds of functionality. Sales is anxious to sell something - which is always the case and wants to use the firefight to showcase the product. But this is extraordinarily risky without some time (a few hours) to test those configurations in the client environment. So you go with the smallest configuration that offers reasonable visibility - I'm only interested in getting visibility into the nature of the performance problem. In this case, we could only manage some manual exercise of the application while the testing environment itself was having problems. We got about 500 metrics, which is much lower than expected for this kind of application but enough to verify that major entry points were being seen and transaction travces had good visibility. On initial deployment to production, this increased to 2500 metrics. But this was also too low as the bulk of usage had already transpired for the operational day. After 24 hours we had increased to about 5000 metrics and this remained consistent over the next few days.

When the problem is not easily reproduced, and QA is not useful, you need to focus on helping the monitoring team to understand what 'normal' is for the application. We I first look at an application, I have a lot of experience to bring to bear for similar apps. So I have an idea of what is correct and what is problematic. This becomes more accurate, as operational visibility accrues. I'll build the initial report based on my findings from the QA exercise but I'll shortly have the monitoring team do the mechanics of generation. For the first day, we will have them transfer the metrics repository, so I can (working remotely) generate the report AND confirm that we are getting the major points of interest. I then give them the report template (part of a management module - jar file) and have them generate the report. Transferring the metrics archive (via ftp) can take 30-60 minutes and a little work. Generating the report takes a couple minutes, and can be sent via email. As we transfer responsibility, the workload actually decreases. So push back is never a problem here.

We will go through the report each day, illustrating how to present the information in a neutral fashion, indicating the what appears normal, or abnormal. After 2-3 sessions, the monitoring team will begin to review the report on the bridge call. I'm just there for air cover. The goal is to get everyone looking at the same information (the report) and understanding 'normal'. When the event is encountered again, the practice of the report review directly contrasts the 'incident' with 'normal'.

Very often the bridge call is little more than a shouting match. Participants have access to their own data and information about the nature of the incident and "orally" share this on the bridge. There is a lot of subjective reasoning that goes on that can be a distraction and ultimately counterproductive. Everybody wants to add value but but the conclusions are not always consistent. Being able to quickly share the baseline of application performance simply puts everyone on the same page. It also addresses a major gap with bridge calls - no access to historical information. Simply having a 24 hour report, every day, is enough to change this shortcoming. Initially you will have to spend a bit of time educating the team in how to interpret the metrics, explain some language and terms - but most teams come up to speed quickly.

Of course you might think that getting everyone access to the workstation displays of performance information would be even better. However, most organizations are simply not yet sharing information, even a simple baseline performance report. You have to establish this process and skill level first before you introduce the richness of the workstation display. Otherwise, the team will get distracted, confused, or frustrated - and they will never even get to the baseline information.

Tuesday, September 1, 2009

Awaiting Funding

The last month was a bit tedious but I was earlier invited to propose a book outline that covers all of my exploits and learning while assisting clients in their realization of application performance management. This blog will cover those developments leading to the publication of the book and the feedback on the book once it gets released to the public at large. I am optimistic that this will happen, even as a small issue of funding remains in limbo.

I have been at this crossroads a few times before. While IT organizations, in general, would benefit from a thorough discussion of this topic, product management is not fully on board. A good part of this is because the techniques are largely vendor-neutral. This is important because the marketplace has matured and only a few players remain. Of those players, only one is in growth mode and continuing investment. It is believed, by some, that releasing these best practices and understanding of the marketplace will give these ailing competitors a second chance. My view is that by educating our potential customers, and helping them utilize their existing investments (even with competing tools) that we will increase their pace of adoption and they will naturally choose to partner with the stronger player. Fortunately, sales already knows that this approach will accelerate adoption, and thus license revenue. Getting these techniques in book form simply scales the number of interactions we can support. Who will win out?

The book will be based on my library of presentations and discussion papers which, to date, has been limited to client engagements and internal training. Public notice of this body of work is limited to a single event, in November 2007, where a reporter wrote an unauthorized story about a presentation by one of my clients where they discussed how they had establish a Performance Center of Excellence and was practicing true, proactive management of their application lifecycle. That's a lot of buzz words but given that it was done largely with their own staff, and had already demonstrated value, it came as a bit of a shock, compared to the much longer periods that such IT initiatives need in order to show some success. There were some services, of course. I led that initiative and a number of other similar programs, in what we call the "Mentoring Model". I've seen things...

Anyway, this single event launch a number of sales initiatives, based on the concept of quickly building client teams and following our APM Best Practices, to dramatically increase the pace of adoption of APM Tools, and the value realized when employing these tools. This resulted in a remarkable quantity of deals, each in excess of $10M in new product revenue. None of these deals had services attached to them, which is both annoying (I'm in professional services) and illuminating. We will still do some services with these clients but it will be of very high value and limited duration. Not at all like the large staffing and long duration of traditional IT initiatives.

What is this APM marketplace? Why does it warrent such investment? What kind of value does it return? How do you get started? What are these APM best practices? How do you build your own team? These are all topics which the book will present and discuss. It is a topic that I believe the world needs to know about. But I am largely constrained by the proprietary and confidential nature of my work - so my employers' commissioning of this book is an essential piece of the puzzle.

Historically, in my career in the commercial software business, I have been at this precipice three times already, in having a large body of internal-only materials, and a marketplace starving for guidance, direction and illumination. Is this fourth time the charm? One of my mentors had suggested, a few years back, that no one can fail when they listen to the customer and find a way to give them what they need. I've got it all together this time. I've kept the faith. But I really need to see it happen... once!

Now if I can only get the funding to bring it to light.