Friday, March 19, 2010

Contract signed - now get back to writing!

Well, after many months of little progress the proposal has become project and we are with a signed contract with Apress to produce the book. I think we can get the whole project wrapped for the fall - maybe September. I still need to finalize the editorial schedule, and sign up some more internal reviewers.

Wednesday, December 16, 2009

Oblicore - Service Level Management

Notes from the public (registration required)online seminar 16 December 2009, 11:00 AM EST:

Deming (father of quality control) - Plan, Do, Check (measure), Act

SLA can mean many things - ITIL, ERP, CRM, COBiT - folks think they are talking about the same things but meaning varies, depending on where they are in the organization.

"When you have no destination - Every road leads nowhere". You need to align business needs with targets.

- An SLA is a contract between a provider and a customer. Service Level Agreement, Operational Level Agreement, Underpinning Contract. They all document targets and specifies responsibilities of the parties.

- SLM -- ITIL Service Level Management. Tells you what to do (but necessarily how to do it)
- Define, document, agree, monitor, measure, report and review the level of IT services provided.

Most companies manage SLAs manually. Each contract negotiated separately. Labor intensive data collection. Reporting is reactive.

- Establish - Implement - Manage - Review:

Service Improvement Plan: Define Strategy, Planning, Implementation, Catalog Services, Draft, Negotiate, Review SLAs SCs and OLAs, Agree, measure and monitor, Report, Review SLM process, Review SLAs SCs and OLAs (back to start of loop).

( lots about roles for the above steps...)

- Standardizing Contract Creation and Revision - Data collection and metrics need to be uniform.
-- most SLAs require multiple data sources. Need to aggregate the data, apply exceptions (TOD, external events), correlations with other incidents.

(more on types of pain when a robust process for SLA management is absent)

Oblicore does the last mile to drive an automated ITIL approach, via a double closed-loop process. (Geeze - this is buzzword heavy!).

(demo of Oblicore Guarantee) UI is browser-based, multiple tabs. Basically workflow management with a federated view of multiple metrics, over a various web-based forms. following the ITIL model. Generates paper! But somebody has to sign this stuff - so, way better than excel... They use adapters to bring in various metrics. Looks like an ETL transform (Table/Field assignment). Don't know how real-time this could be.

Friday, December 4, 2009

Storm Clouds are Clearing

Well the word is that we are refitted with an appropriate executive sponsor and a status call is scheduled for Monday 7 December. I'm still expecting a few bumps in the road.

Review - Symantec I3 - A Performance Management Methodology

[Clearing out the Drafts folder (Dec 4)]
http://silos-connect.com/solutions/i3methodologybook.pdf

This has a 2005 copyright - so I'm not expecting to get anything too useful here. But it does show up when you are searching on "APM Methodologies".

Introduction - Poor app performance translate into poor end-user experience.

Chapter 1 - Defines incidents as "performance red zones" - where the system fails to meet performance goals, and alludes that these may also be used to define SLAs. They break down incidents into four major types, and I've added some more conventional definition of what they had in mind:
  1. Design and Deployment Problems (Design, Code, Build, Test, Package, Deploy)
  2. System Change Problems (Configuration mistakes due to tuning or attempted fix)
  3. Environment Change Problems (unexpected load or usage patterns)
  4. Usage Problems (Prioritized Access to Shared IT Resources)
The summary of that section presents that:
"Performance management comprises three core activities, reactive, proactive and preventative. Symantic i3 provides the techniques and the tools that are necessary to carry out these activities using a structured, methodical, and holistic approach."
As much as the core activities are accurate, there is nothing in chapter 1 that supports that summary.

Chapter 2 - provides the substance for the core activities described in the summary from chapter 1.
  • Reactive Management - maximize alertness and minimize time-to-detection of problems, equip staff with the right tools to minimize time-to-resolution. I guess lot's of caffine is one of the tools!
  • Proactive Management - "close monitoring" will often detect that problems are likely to occur. I guess you need to be in the same room as the monitoring...
  • Preventative Management - Minimize performance problems in the first place by employing periodic health checks and resolve problems before they get out of hand.
  • Hybrid Approach - Combine all (3) management styles, with feed-forward and feed-back, leading to an overall decline in firefighting situations.
I don't mind where they ended up (hybrid approach) but it seems very operations-centric, which is unusual because the tool itself is developer-focused - and I've never met anyone who could use i3 in operations. But phrases like "maximize alertness" and "close monitoring" suggest that the contract writer didn't know that much more about monitoring than the marketing dude who approved this whitepaper.

Reactive Management IS jumping to action when an incident occurs. But the call to action comes from an alert and the problem with alerts is that they very often don't bubble up to a Trouble Management console until some 15-30 minutes after the incident started. So somehow being closer to the keyboard doesn't really add any value - you are still the last to know. When your helpdesk is a more timely indicator of application health, than your monitoring solution - this is when you know you have a major visibility deficit.

Likewise, defining proactive as more frequent monitoring, in order to close that 15-30 minute gap is also a disservice. The operations team is simply not in a position to be constanting reviewing the key metrics for hundreds of applications under their watch. They rely on alerts to filter out the application noise (all those squiggly traces) and indicate which app is having problems. It might surprise you to also realize that operations has exactly two responses to an alert: 1. Restart the app, or 2. open a bridge call, and restart the app. That's 95% of modern IT today and that's the real problem.

I define proactive management as preventing the problem from getting deployed in the first place - not simply responding to an alert more quickly. The writer assigns this instead to preventative management, so maybe it is only a semantic (funny, no?) difference. But if they mean the periodic healthchecks as something that is occurring in operations, then this is a fantasy. Reverse-engineering the normal performance characteristics from an operational application is a massive task. Remember, we are not talking about the home web server - we are talking about a commercial, enterprise environment with hundreds of applications. That's really not the role for operations to undertake and decide. In order to make the problem tractable, the app owners have to pass judgment on what is normal, and what is abnormal for their app. In reality, they are the only ones who have a chance at understanding.

Regards the hybrid approach, that's something useful but only if we are feeding operational information back to Dev and Qa, in order to improve the initial testing of the app. And feeding forward results from QA, Stress and Performance, that can be used to set initial alert thresholds and maybe some info as to what known problems might actually look like, hopefully in terms of run documentations and/or dashboards.

The dashboard metaphor is key because no one has any time to figure out where the logs are for a given app. You need a mechanism to present details about the app, along with a list of who to call and what some of the resources involved might be.

Chapter 3 - Symantic Methodology - Process and Tools. Oops - I prefer People, Process and Product (tools). Must be some Borg over there at Symantec... And they claim the effectiveness of their "proven process" - but they haven't actually defined it yet... and then we are on to the products Insight, Indepth and Inform. And that's all for that section.

Performance Management Stages include detect (symptoms), find (source), focus (root-cause), improve (follow steps to improve) and verify (verify). I call that Triage and Remediation. My first step is not detect but refute! I prefer a less adversarial approach by first intoning "how do we know we have a problem?" It sometimes makes for some uncomfortable moments of silence.

Think about it. Someone notifies you that there is an urgent issue. How do you know that they are accurate? What do you look at to confirm that the issue is actually an incident? This is an important bit of process because the stages of explanation can become something consistent, like a practiced drill that can be executed when stress is high and patience is short. When everybody is used to it, it is actually quicker to review what's right and what is apparently wrong. And you need a steady soul to initiate this and keep it on track - but that's what effective triage is really about - deliberate, conclusive steps until something is found out of place. Not as much fun as running around with your hair on fire but a whole lot more effective and predictable.

One of the big complaints I hear from operations is that alerts are dropping in all the time. It's not just too many alerts and their frequency. It often means too many alerts for which nothing could be found indicating a problem. False-positives. Not actionable. Ultimately these are due to defects in visibility - and something that no amount of pressure and screaming can resolve. And there is an important point skipped over - the nature of alerts. Most alerts are that a system is unavailable - it has gone down or is unreachable. The supposed "pro-active" alerts - these are different because they have thresholds defined. It's not unfair to suggest that alerting on a threshold is more visibility than a "gone-down" alert - but it certainly isn;t "proactive" - it's using a threshold to define the alert. Duh! But how do you arrive at the threshold? What metrics do you select?

As the writer revisits the different management style, they point out that the mechanisms to realize the process improvements are part of the product (tools). Well, that's cool but what I really like to focus on are the product-neutral processes - the things I should be doing no matter whose tool set is in play. Sure, it's nice that you have an automated mechanism to periodically do performance reviews. What do you look at? What metrics are important? What changes are significant? How does this relate to what the operations team is needed to be more effective.

Process is something that is easy to wave around. Automated processes sound even better. What the process is and how it relates to the current organization and capabilities - I don't think the Symantec tool has any concept of that. And that is the gap that limits adoption and, ultimately, the utility of the tool.

In summary, the Symantec methodology is detect, find, focus, improve, and verify. The different products (tools) implement and automate these processes. I guess that could be anything.

Section 2 - devotes a chapter (chapters 4-8) to each step of the methodology. Performance Reviews are the mechanism: the periodic health check. This requires (4) types of reports: Top-N (response times), Trends and Exceptions. Apparently, there are only (3)! I guess the invocations Top-N report is implied. The trends are just a historical view of a metric - no magic here! The exceptions are an historical view of exceptions and errors - also not magical. There should be a status or overview fo the environment - again, no magic needed.

As the author moves back to Reactive Management, they introduce the concept of resource monitoring - databases, messaging, etc. These need specialized views (or tools), and no surprise here but can't we be "proactive" for resource management as well?

Monday, November 30, 2009

The Essence of Triage

[clearing out the drafts folder - Nov 30]

When interpreting performance data for an incident the question often arises as to what should we look at first. For my APM practice I always focus on "what changed" and this is easily assessed by comparing with the performance baseline or signature for the affected application.

But for folks new to APM, and often very much focused on jumping in to look at individual metrics, you can easily get confused but so many metrics will be suspicious. There are some attributes of the application; response times, invocation rates, garbage collection, CPU, that will be out of normal. And folks will bias their recommendations as to which avenue to explore based on the experience they have with a particular set of metrics.

My approach to this is pretty basic: go for the "loudest" characteristic first. Much like "the squeeky wheel gets the oil" - the "loudest" metric is where you should start you investigation, moving downhill. More importantly, you need to stop looking once you have found a defect or deviation from your expectations, and get it fixed and then look again for the "loudest' metric.

This is important because the application will re-balance after the first problem is addressed, and you will get a new hierarchy of "loud" metrics to consider.

For example, let's assert a real-world scenario where there are two servlets, each of which is accessing a separate database. Use Case A accesses Servlet A, which is accessing an RDBMS and has a query response time of 4 seconds. Servlet A has a response time of 5 seconds. Use Case B has a response time of 2 seconds, accessing Servlet B and makes a query via messaging middleware to a mainframe HFS, which takes 1 second. Which of these is the loudest problem?

If you feel that a servlet response time of 5 seconds is a pretty good clue. You would be wrong. Sure, everyone should know that a servlet response time should be on the order of 1-3 seconds. And certainly being able to compare this performance to an established baseline would confirm it.

Instead, we will limit our consideration to the Use Case which has actually has users complaining, which in this case is Use case B.

"Wait a minute! You didn't tell us which use case had users complaining!"

Right. And neither will your users (real-world scenario)! What I'm driving at is that you can't know where to look until you know what has changed. And you can't know what has changed unless you have a normal baseline with which to compare. It's always nice when you have an alert or user complaint to help point you in the right direction but that can be unreliable as well.

For all this I prefer what I call the "Dr. House" model. Dr. House is a TV show what House draws out the root cause for troublesome medical cases he and his team are involved with. I think it's a great model for triage and diagnosis of application performance. One of the axioms of House's interactions with patients is that "everybody lies", when they present their medical history.

This is how I approach triage - everybody lies (or in corporate-neutral language: inadvertently withholds key information). So I base much of my conclusions as to how to proceed and what to look for, based on what I can expose by comparing the current activity, to 'normal' - or whatever baseline I can develop.

Wednesday, November 25, 2009

Yikes - project crashing!

My executive sponsor has bailed on the project. No details yet as to why. I am bummed. Sure, just need to sign up a fresh one but time marches on...

Tuesday, November 24, 2009

What You Should Know... part 2

Well, this part was less annoying.
  • Application lifecycle - includes development. Tru-dat my brother! Too bad the authors example is of a code-profiler. And yes, most APM-savvy folks do not include development as part of APMonitoring. But if you really want to improve app performance (and the end-user experience) - you need cooperation from development.
  • The SOA flag - This is not so bad. He inserts that ASM acronym again. But otherwise, this is accurate and helpful. You know, if you really think that APM is overloaded (which it is) - how about SPM - Service Performance Management. Everybody knows "ASM" means an assembler directive anyway!
And then a few other points - and the rest is an ad for a MicroSoft product. And now all the spin makes sense - a desktop view of the world IT management problem. Not so innovatative. Read at you own risk.

All in all, what can I say about this author? He has 30 books published. I have a book proposal. So I am crap.

But. He writes about APM. I "do" APM. I help clients realize APM. I "talk the talk" AND "walk the walk".

The author is "two-thousand and LATE"! ;-)

Now, if I only had a book...