Tuesday, September 22, 2009

Funding is confirmed

Off on vacation I received an email that funding was to be approved. I'll get the details when I get back to work next week. I immediately sent off a notice to the editor candidate as we still have to do a final interview. I think everything will be fine but you just can't know until share the plan, goals and concerns.

Saw some corporate messaging around PCMM - People Capability and Maturity Model, as part of an HR initiative. Maybe influenced by the earlier ICMM - Instrumentation Capability and Maturity Model... but that might be too bold an assertion. Still nice to see some parallel efforts to document the skills, process and competencies appropriate for a body of experiences.

Before vacation, I got pulled into a firefight of a stability/scalability issue. No QA testing was possible, so we went into production with a default configuration. With the holiday weekend, we got 5 days of data before the issue was reproduced. This is an excellent scenario as the application team was unaware that monitoring was deployed so far in advance, so they immediately suggested that "probably the monitoring was the culprit...". The bridge call got quiet quick when they learned that we had been running without incident for 5 days already. Then the monitoring team dug right into the heart of the matter - the load had tripled. Was this something that was tested? More silence on the call. Simply put, the app cannot scale to the load that was delivered.

We also had a CritSit team from another major vendor who were proposing getting some instrumentation in place, checking logs, etc. - they would need about a week. While that monologue unfolded, the monitoring team generated a report with the bulk of that info and circulated via email. The critsit team scurried to open up that pdf, we reviewed it a bit and then they realized that we had already everything they were proposing. "So can we all agree to start looking for the bottleneck?" For me, that means time to come off site. The job is done and vacation awaits!

At the onset of this engagement, there was considerable pressure to deploy all kinds of functionality. Sales is anxious to sell something - which is always the case and wants to use the firefight to showcase the product. But this is extraordinarily risky without some time (a few hours) to test those configurations in the client environment. So you go with the smallest configuration that offers reasonable visibility - I'm only interested in getting visibility into the nature of the performance problem. In this case, we could only manage some manual exercise of the application while the testing environment itself was having problems. We got about 500 metrics, which is much lower than expected for this kind of application but enough to verify that major entry points were being seen and transaction travces had good visibility. On initial deployment to production, this increased to 2500 metrics. But this was also too low as the bulk of usage had already transpired for the operational day. After 24 hours we had increased to about 5000 metrics and this remained consistent over the next few days.

When the problem is not easily reproduced, and QA is not useful, you need to focus on helping the monitoring team to understand what 'normal' is for the application. We I first look at an application, I have a lot of experience to bring to bear for similar apps. So I have an idea of what is correct and what is problematic. This becomes more accurate, as operational visibility accrues. I'll build the initial report based on my findings from the QA exercise but I'll shortly have the monitoring team do the mechanics of generation. For the first day, we will have them transfer the metrics repository, so I can (working remotely) generate the report AND confirm that we are getting the major points of interest. I then give them the report template (part of a management module - jar file) and have them generate the report. Transferring the metrics archive (via ftp) can take 30-60 minutes and a little work. Generating the report takes a couple minutes, and can be sent via email. As we transfer responsibility, the workload actually decreases. So push back is never a problem here.

We will go through the report each day, illustrating how to present the information in a neutral fashion, indicating the what appears normal, or abnormal. After 2-3 sessions, the monitoring team will begin to review the report on the bridge call. I'm just there for air cover. The goal is to get everyone looking at the same information (the report) and understanding 'normal'. When the event is encountered again, the practice of the report review directly contrasts the 'incident' with 'normal'.

Very often the bridge call is little more than a shouting match. Participants have access to their own data and information about the nature of the incident and "orally" share this on the bridge. There is a lot of subjective reasoning that goes on that can be a distraction and ultimately counterproductive. Everybody wants to add value but but the conclusions are not always consistent. Being able to quickly share the baseline of application performance simply puts everyone on the same page. It also addresses a major gap with bridge calls - no access to historical information. Simply having a 24 hour report, every day, is enough to change this shortcoming. Initially you will have to spend a bit of time educating the team in how to interpret the metrics, explain some language and terms - but most teams come up to speed quickly.

Of course you might think that getting everyone access to the workstation displays of performance information would be even better. However, most organizations are simply not yet sharing information, even a simple baseline performance report. You have to establish this process and skill level first before you introduce the richness of the workstation display. Otherwise, the team will get distracted, confused, or frustrated - and they will never even get to the baseline information.

No comments:

Post a Comment