Tuesday, September 22, 2009

Funding is confirmed

Off on vacation I received an email that funding was to be approved. I'll get the details when I get back to work next week. I immediately sent off a notice to the editor candidate as we still have to do a final interview. I think everything will be fine but you just can't know until share the plan, goals and concerns.

Saw some corporate messaging around PCMM - People Capability and Maturity Model, as part of an HR initiative. Maybe influenced by the earlier ICMM - Instrumentation Capability and Maturity Model... but that might be too bold an assertion. Still nice to see some parallel efforts to document the skills, process and competencies appropriate for a body of experiences.

Before vacation, I got pulled into a firefight of a stability/scalability issue. No QA testing was possible, so we went into production with a default configuration. With the holiday weekend, we got 5 days of data before the issue was reproduced. This is an excellent scenario as the application team was unaware that monitoring was deployed so far in advance, so they immediately suggested that "probably the monitoring was the culprit...". The bridge call got quiet quick when they learned that we had been running without incident for 5 days already. Then the monitoring team dug right into the heart of the matter - the load had tripled. Was this something that was tested? More silence on the call. Simply put, the app cannot scale to the load that was delivered.

We also had a CritSit team from another major vendor who were proposing getting some instrumentation in place, checking logs, etc. - they would need about a week. While that monologue unfolded, the monitoring team generated a report with the bulk of that info and circulated via email. The critsit team scurried to open up that pdf, we reviewed it a bit and then they realized that we had already everything they were proposing. "So can we all agree to start looking for the bottleneck?" For me, that means time to come off site. The job is done and vacation awaits!

At the onset of this engagement, there was considerable pressure to deploy all kinds of functionality. Sales is anxious to sell something - which is always the case and wants to use the firefight to showcase the product. But this is extraordinarily risky without some time (a few hours) to test those configurations in the client environment. So you go with the smallest configuration that offers reasonable visibility - I'm only interested in getting visibility into the nature of the performance problem. In this case, we could only manage some manual exercise of the application while the testing environment itself was having problems. We got about 500 metrics, which is much lower than expected for this kind of application but enough to verify that major entry points were being seen and transaction travces had good visibility. On initial deployment to production, this increased to 2500 metrics. But this was also too low as the bulk of usage had already transpired for the operational day. After 24 hours we had increased to about 5000 metrics and this remained consistent over the next few days.

When the problem is not easily reproduced, and QA is not useful, you need to focus on helping the monitoring team to understand what 'normal' is for the application. We I first look at an application, I have a lot of experience to bring to bear for similar apps. So I have an idea of what is correct and what is problematic. This becomes more accurate, as operational visibility accrues. I'll build the initial report based on my findings from the QA exercise but I'll shortly have the monitoring team do the mechanics of generation. For the first day, we will have them transfer the metrics repository, so I can (working remotely) generate the report AND confirm that we are getting the major points of interest. I then give them the report template (part of a management module - jar file) and have them generate the report. Transferring the metrics archive (via ftp) can take 30-60 minutes and a little work. Generating the report takes a couple minutes, and can be sent via email. As we transfer responsibility, the workload actually decreases. So push back is never a problem here.

We will go through the report each day, illustrating how to present the information in a neutral fashion, indicating the what appears normal, or abnormal. After 2-3 sessions, the monitoring team will begin to review the report on the bridge call. I'm just there for air cover. The goal is to get everyone looking at the same information (the report) and understanding 'normal'. When the event is encountered again, the practice of the report review directly contrasts the 'incident' with 'normal'.

Very often the bridge call is little more than a shouting match. Participants have access to their own data and information about the nature of the incident and "orally" share this on the bridge. There is a lot of subjective reasoning that goes on that can be a distraction and ultimately counterproductive. Everybody wants to add value but but the conclusions are not always consistent. Being able to quickly share the baseline of application performance simply puts everyone on the same page. It also addresses a major gap with bridge calls - no access to historical information. Simply having a 24 hour report, every day, is enough to change this shortcoming. Initially you will have to spend a bit of time educating the team in how to interpret the metrics, explain some language and terms - but most teams come up to speed quickly.

Of course you might think that getting everyone access to the workstation displays of performance information would be even better. However, most organizations are simply not yet sharing information, even a simple baseline performance report. You have to establish this process and skill level first before you introduce the richness of the workstation display. Otherwise, the team will get distracted, confused, or frustrated - and they will never even get to the baseline information.

Tuesday, September 1, 2009

Awaiting Funding

The last month was a bit tedious but I was earlier invited to propose a book outline that covers all of my exploits and learning while assisting clients in their realization of application performance management. This blog will cover those developments leading to the publication of the book and the feedback on the book once it gets released to the public at large. I am optimistic that this will happen, even as a small issue of funding remains in limbo.

I have been at this crossroads a few times before. While IT organizations, in general, would benefit from a thorough discussion of this topic, product management is not fully on board. A good part of this is because the techniques are largely vendor-neutral. This is important because the marketplace has matured and only a few players remain. Of those players, only one is in growth mode and continuing investment. It is believed, by some, that releasing these best practices and understanding of the marketplace will give these ailing competitors a second chance. My view is that by educating our potential customers, and helping them utilize their existing investments (even with competing tools) that we will increase their pace of adoption and they will naturally choose to partner with the stronger player. Fortunately, sales already knows that this approach will accelerate adoption, and thus license revenue. Getting these techniques in book form simply scales the number of interactions we can support. Who will win out?

The book will be based on my library of presentations and discussion papers which, to date, has been limited to client engagements and internal training. Public notice of this body of work is limited to a single event, in November 2007, where a reporter wrote an unauthorized story about a presentation by one of my clients where they discussed how they had establish a Performance Center of Excellence and was practicing true, proactive management of their application lifecycle. That's a lot of buzz words but given that it was done largely with their own staff, and had already demonstrated value, it came as a bit of a shock, compared to the much longer periods that such IT initiatives need in order to show some success. There were some services, of course. I led that initiative and a number of other similar programs, in what we call the "Mentoring Model". I've seen things...

Anyway, this single event launch a number of sales initiatives, based on the concept of quickly building client teams and following our APM Best Practices, to dramatically increase the pace of adoption of APM Tools, and the value realized when employing these tools. This resulted in a remarkable quantity of deals, each in excess of $10M in new product revenue. None of these deals had services attached to them, which is both annoying (I'm in professional services) and illuminating. We will still do some services with these clients but it will be of very high value and limited duration. Not at all like the large staffing and long duration of traditional IT initiatives.

What is this APM marketplace? Why does it warrent such investment? What kind of value does it return? How do you get started? What are these APM best practices? How do you build your own team? These are all topics which the book will present and discuss. It is a topic that I believe the world needs to know about. But I am largely constrained by the proprietary and confidential nature of my work - so my employers' commissioning of this book is an essential piece of the puzzle.

Historically, in my career in the commercial software business, I have been at this precipice three times already, in having a large body of internal-only materials, and a marketplace starving for guidance, direction and illumination. Is this fourth time the charm? One of my mentors had suggested, a few years back, that no one can fail when they listen to the customer and find a way to give them what they need. I've got it all together this time. I've kept the faith. But I really need to see it happen... once!

Now if I can only get the funding to bring it to light.