by Michael Procopio, Product Manager, BAC
In the world of advanced analytics, two areas that are of interest to the IT management world are: detection of a problem and isolation of a problem. Previously I wrote Advanced analytics reduces downtime costs – detection; in this post I’ll cover isolation.
In the previous post, I covered how advanced analytics finds an anomaly, potentially before a threshold is crossed.
Problem Isolation is the process of determining which component in the infrastructure is causing the problem* or incident* that we found. We will presume we are monitoring the service that is having the issue.
If one had no management tools (amazingly I have spoken to customers in this situation) the method of trying to find a problem is to login to each system, router, switch and potentially application (ex: Oracle) look at the items with whatever tools are available (ex: Windows Perfmon)and hopefully you find it. If you are interested in advanced analytics, this is probably not your situation.
The more typical case is you have multiple management tools, network, system, virtualization, database and perhaps others. So if you know the domain the problem exists in you have a good place to start. I’ve listened to podcasts / read reports which bring up few problems with this: (if you know of any good IT podcasts please send them along)
- ~80% of problems are sent to the network team with only ~20% being network issues
- ~60% of problems take >10 experts to resolve
- ~80% of the time to restore service is spent isolating the problem
Here is an analogy I use with my non IT friends on why this area is needed. You are monitoring the speed of a car going across the country (pick your favorite country). You are separately monitoring the infrastructure, all:
- roads
- bridges
- ferries to take cars across the water
What you don’t know is where the car is (old car, no GPS). You are getting many alerts from the roads, bridges and ferries. Which one is affecting the car? Since you don’t know what road the car is on you don’t know if any given alert is the one affecting your car.
This is where the CDMB comes into the isolation process. The CMDB has the route the car is taking or, in our case, the items in the IT infrastructure that make up the service that has the problem.
Part one of the isolation process is to restrict what we are looking at to the relevant IT items. This greatly reduces the computational power required. For example, one customer I recently visited told me he has 2000+ servers. If we can reduce that to a few app servers and a few database servers (isn’t SOA wonderful for we operations types) that is a factor of ~200 reduction.
Part two of the isolation is the heavy math from HP Labs, with more patent filings. It is a form of regression analysis, where application or end user response time monitoring is the dependent variable and all the infrastructure metrics are independent variables. In plain terms, if end user response gets worse find the infrastructure metrics that get worse. When end user response gets better find the metrics that get better. The more closely an infrastructure metric tracks the end user response the more likely it is to be the cause.
Again, while the math is interesting, pictures work better for me.
The thick grey line is the end user response, the red-purple line is the most closely correlated metric -- in this case a database metric. Just so you don’t have to strain your eyes we provide a table like this (from a different problem) showing the weighted correlations score.

Isolation part 3 is to include non-time series data. In the screen capture below you see planned changes and incident details (think alerts) on the timeline. Unplanned changes can also be displayed. Changes are pulled from the CMDB and incidents can come from any management system that can send alerts. And since we know that most problems occur from changes that is an important component. Finally tickets from the helpdesk are included on the timeline, for the case where users are doing the monitoring.

All together this automates a number of things the operations teams already do and some math help isolating problems.
*Incident and problem are ITIL terms. There may be many incidents that are symptoms of an underlying problem.
tweet this!
Related Items
Since I asked for podcasts here are some I listen too:
Posted
05-08-2009 3:40 PM
by
Michael_Procopio