by Michael Procopio
In the world of advanced analytics, two areas that are of interest to the IT management world are: detection of a problem and isolation of a problem. In this post I'll cover detection.
Problem detection, typically called anomaly detection, in the analytics circles started in a very basic way. Take a metric, say CPU utilization, set a threshold for it and anytime the threshold is crossed, we have an anomaly.
This, of course, has many problems:
· How do I know where to set a threshold
· The right level may be different at different times
· If there is a one sample spike above the threshold is that really an anomaly I care about (for some it is but not for most in my experience)
The next step in setting thresholds was using standard deviation (STD). I will create a sleeve of upper and lower bounds that cover a major percentage of the situations (+/- 1 STD covers 68.2%) I have measured and use that. This has some of the same problems as above. However, let’s focus on the time period problem.
The next step is to set thresholds by time of day. With this added capability I can set a reasonable threshold for the typical 10am and 2pm peak traffic periods separately and alerts still come if there is unusual behavior at 8am. This quickly leads to “my Mondays are busier than most of my other days”. To avoid false alerts, this leads us to time of day and day of week where we keep the standard deviation for each of the 168 hours of the week.
The next problem, the end of quarter booking and shipping madness or it is black Friday (largest shopping day of the year) and we realize we need to add in a seasonally adjusted set of thresholds as well. And something that seasonality can't take into account are macro events such as a weak economy affecting purchasing.
Of course, none of these will take into account the spikes mentioned above. How to solve all these problems -- hmmm.
There is a completely different approach, for which HP Labs has a patent filing, which uses more sophisticated machine learning. Like the other approaches, it breaks time up into segments, which, in the paper Achieving Scalable Automated Diagnosis of Distributed Systems Performance Problems, are called epochs. Unlike the other approaches, it does not simply compare now to a predetermined set of threshold levels.
This method compares ‘now’ to recent epochs as well as previous learning and makes a determination of what behavior is good and bad. While a spike is bad, if the epoch is behaving well overall it is considered to have good behavior. There is a lot of math behind this but I find looking a picture much more obvious.

Notice the gray outline the looks like a city skyline. This is what the algorithm has determine is ‘normal’ or good behavior. The hatch lines on the right show where it found an anomaly.
These advanced algorithms are implemented in HP Problem Isolation. In the next post, I’ll discuss how analytics are used to find the source of the problem.
tweet this!
Related Items
Posted
04-15-2009 6:01 PM
by
Michael_Procopio