Wednesday, December 16, 2009

Oblicore - Service Level Management

Notes from the public (registration required)online seminar 16 December 2009, 11:00 AM EST:

Deming (father of quality control) - Plan, Do, Check (measure), Act

SLA can mean many things - ITIL, ERP, CRM, COBiT - folks think they are talking about the same things but meaning varies, depending on where they are in the organization.

"When you have no destination - Every road leads nowhere". You need to align business needs with targets.

- An SLA is a contract between a provider and a customer. Service Level Agreement, Operational Level Agreement, Underpinning Contract. They all document targets and specifies responsibilities of the parties.

- SLM -- ITIL Service Level Management. Tells you what to do (but necessarily how to do it)
- Define, document, agree, monitor, measure, report and review the level of IT services provided.

Most companies manage SLAs manually. Each contract negotiated separately. Labor intensive data collection. Reporting is reactive.

- Establish - Implement - Manage - Review:

Service Improvement Plan: Define Strategy, Planning, Implementation, Catalog Services, Draft, Negotiate, Review SLAs SCs and OLAs, Agree, measure and monitor, Report, Review SLM process, Review SLAs SCs and OLAs (back to start of loop).

( lots about roles for the above steps...)

- Standardizing Contract Creation and Revision - Data collection and metrics need to be uniform.
-- most SLAs require multiple data sources. Need to aggregate the data, apply exceptions (TOD, external events), correlations with other incidents.

(more on types of pain when a robust process for SLA management is absent)

Oblicore does the last mile to drive an automated ITIL approach, via a double closed-loop process. (Geeze - this is buzzword heavy!).

(demo of Oblicore Guarantee) UI is browser-based, multiple tabs. Basically workflow management with a federated view of multiple metrics, over a various web-based forms. following the ITIL model. Generates paper! But somebody has to sign this stuff - so, way better than excel... They use adapters to bring in various metrics. Looks like an ETL transform (Table/Field assignment). Don't know how real-time this could be.

Friday, December 4, 2009

Storm Clouds are Clearing

Well the word is that we are refitted with an appropriate executive sponsor and a status call is scheduled for Monday 7 December. I'm still expecting a few bumps in the road.

Review - Symantec I3 - A Performance Management Methodology

[Clearing out the Drafts folder (Dec 4)]
http://silos-connect.com/solutions/i3methodologybook.pdf

This has a 2005 copyright - so I'm not expecting to get anything too useful here. But it does show up when you are searching on "APM Methodologies".

Introduction - Poor app performance translate into poor end-user experience.

Chapter 1 - Defines incidents as "performance red zones" - where the system fails to meet performance goals, and alludes that these may also be used to define SLAs. They break down incidents into four major types, and I've added some more conventional definition of what they had in mind:
  1. Design and Deployment Problems (Design, Code, Build, Test, Package, Deploy)
  2. System Change Problems (Configuration mistakes due to tuning or attempted fix)
  3. Environment Change Problems (unexpected load or usage patterns)
  4. Usage Problems (Prioritized Access to Shared IT Resources)
The summary of that section presents that:
"Performance management comprises three core activities, reactive, proactive and preventative. Symantic i3 provides the techniques and the tools that are necessary to carry out these activities using a structured, methodical, and holistic approach."
As much as the core activities are accurate, there is nothing in chapter 1 that supports that summary.

Chapter 2 - provides the substance for the core activities described in the summary from chapter 1.
  • Reactive Management - maximize alertness and minimize time-to-detection of problems, equip staff with the right tools to minimize time-to-resolution. I guess lot's of caffine is one of the tools!
  • Proactive Management - "close monitoring" will often detect that problems are likely to occur. I guess you need to be in the same room as the monitoring...
  • Preventative Management - Minimize performance problems in the first place by employing periodic health checks and resolve problems before they get out of hand.
  • Hybrid Approach - Combine all (3) management styles, with feed-forward and feed-back, leading to an overall decline in firefighting situations.
I don't mind where they ended up (hybrid approach) but it seems very operations-centric, which is unusual because the tool itself is developer-focused - and I've never met anyone who could use i3 in operations. But phrases like "maximize alertness" and "close monitoring" suggest that the contract writer didn't know that much more about monitoring than the marketing dude who approved this whitepaper.

Reactive Management IS jumping to action when an incident occurs. But the call to action comes from an alert and the problem with alerts is that they very often don't bubble up to a Trouble Management console until some 15-30 minutes after the incident started. So somehow being closer to the keyboard doesn't really add any value - you are still the last to know. When your helpdesk is a more timely indicator of application health, than your monitoring solution - this is when you know you have a major visibility deficit.

Likewise, defining proactive as more frequent monitoring, in order to close that 15-30 minute gap is also a disservice. The operations team is simply not in a position to be constanting reviewing the key metrics for hundreds of applications under their watch. They rely on alerts to filter out the application noise (all those squiggly traces) and indicate which app is having problems. It might surprise you to also realize that operations has exactly two responses to an alert: 1. Restart the app, or 2. open a bridge call, and restart the app. That's 95% of modern IT today and that's the real problem.

I define proactive management as preventing the problem from getting deployed in the first place - not simply responding to an alert more quickly. The writer assigns this instead to preventative management, so maybe it is only a semantic (funny, no?) difference. But if they mean the periodic healthchecks as something that is occurring in operations, then this is a fantasy. Reverse-engineering the normal performance characteristics from an operational application is a massive task. Remember, we are not talking about the home web server - we are talking about a commercial, enterprise environment with hundreds of applications. That's really not the role for operations to undertake and decide. In order to make the problem tractable, the app owners have to pass judgment on what is normal, and what is abnormal for their app. In reality, they are the only ones who have a chance at understanding.

Regards the hybrid approach, that's something useful but only if we are feeding operational information back to Dev and Qa, in order to improve the initial testing of the app. And feeding forward results from QA, Stress and Performance, that can be used to set initial alert thresholds and maybe some info as to what known problems might actually look like, hopefully in terms of run documentations and/or dashboards.

The dashboard metaphor is key because no one has any time to figure out where the logs are for a given app. You need a mechanism to present details about the app, along with a list of who to call and what some of the resources involved might be.

Chapter 3 - Symantic Methodology - Process and Tools. Oops - I prefer People, Process and Product (tools). Must be some Borg over there at Symantec... And they claim the effectiveness of their "proven process" - but they haven't actually defined it yet... and then we are on to the products Insight, Indepth and Inform. And that's all for that section.

Performance Management Stages include detect (symptoms), find (source), focus (root-cause), improve (follow steps to improve) and verify (verify). I call that Triage and Remediation. My first step is not detect but refute! I prefer a less adversarial approach by first intoning "how do we know we have a problem?" It sometimes makes for some uncomfortable moments of silence.

Think about it. Someone notifies you that there is an urgent issue. How do you know that they are accurate? What do you look at to confirm that the issue is actually an incident? This is an important bit of process because the stages of explanation can become something consistent, like a practiced drill that can be executed when stress is high and patience is short. When everybody is used to it, it is actually quicker to review what's right and what is apparently wrong. And you need a steady soul to initiate this and keep it on track - but that's what effective triage is really about - deliberate, conclusive steps until something is found out of place. Not as much fun as running around with your hair on fire but a whole lot more effective and predictable.

One of the big complaints I hear from operations is that alerts are dropping in all the time. It's not just too many alerts and their frequency. It often means too many alerts for which nothing could be found indicating a problem. False-positives. Not actionable. Ultimately these are due to defects in visibility - and something that no amount of pressure and screaming can resolve. And there is an important point skipped over - the nature of alerts. Most alerts are that a system is unavailable - it has gone down or is unreachable. The supposed "pro-active" alerts - these are different because they have thresholds defined. It's not unfair to suggest that alerting on a threshold is more visibility than a "gone-down" alert - but it certainly isn;t "proactive" - it's using a threshold to define the alert. Duh! But how do you arrive at the threshold? What metrics do you select?

As the writer revisits the different management style, they point out that the mechanisms to realize the process improvements are part of the product (tools). Well, that's cool but what I really like to focus on are the product-neutral processes - the things I should be doing no matter whose tool set is in play. Sure, it's nice that you have an automated mechanism to periodically do performance reviews. What do you look at? What metrics are important? What changes are significant? How does this relate to what the operations team is needed to be more effective.

Process is something that is easy to wave around. Automated processes sound even better. What the process is and how it relates to the current organization and capabilities - I don't think the Symantec tool has any concept of that. And that is the gap that limits adoption and, ultimately, the utility of the tool.

In summary, the Symantec methodology is detect, find, focus, improve, and verify. The different products (tools) implement and automate these processes. I guess that could be anything.

Section 2 - devotes a chapter (chapters 4-8) to each step of the methodology. Performance Reviews are the mechanism: the periodic health check. This requires (4) types of reports: Top-N (response times), Trends and Exceptions. Apparently, there are only (3)! I guess the invocations Top-N report is implied. The trends are just a historical view of a metric - no magic here! The exceptions are an historical view of exceptions and errors - also not magical. There should be a status or overview fo the environment - again, no magic needed.

As the author moves back to Reactive Management, they introduce the concept of resource monitoring - databases, messaging, etc. These need specialized views (or tools), and no surprise here but can't we be "proactive" for resource management as well?