In a post last year, I talked about how to move from user experience monitoring to user experience management, you need to be able to figure out what is the cause of a measured user experience problem, like slow on-line check-in times. I talked about a tool we have called Problem Isolation that helps do to this figuring out.
Up till now, Problem Isolation has used just performance data measured by our agentless probes (from a product called SiteScope) in order to correlate between a top-line performance metric (like online check-in times) and the health of services that top-line metric depends on (database, app server, integration bus, etc). But there is another source of data we haven't included until now -- the events collected by our operations product, HP Operations Manager. If you have HP Operations Manager, you have a massive source of information that can also be used to determine where top-down performance problems lie.
This is how Problem Isolation now uses HP Operations Manager data:
- A business service problem is identified. For example, thru synthetic or real-user monitoring we determine that online check-ins are running too slowly
- A “time buffer” around the problem start time is determined
- The model for the business service in the CMDB is traversed, returning a list of all services supporting the business service
- Events that occurred within the above-mentioned time-buffer relating to those supporting services are determined
- The services with the best-correlating events (taking into account severity as well) are identified as likely suspects
This algorithm applies to any event, whether it’s from a third party enterprise management system (e.g. Tivoli), from HP Operations Manager, or, from HP Network Node Manager.
-------------
In our quest to move from service health monitoring to service health management, we're trying to provide as much information relating to a problem/incident as possible - all in one place in such a way that the information is easily visualizable.
In BAC 8.0, you can see the following regarding a problem service, all from one place:
- The current performance of the service
- The performance of the service over time
- SLAs resting on the service and their closeness to jeopardy
- Business processes using that service and the impact of the problem on those business processes. If you have our Business Transaction Management modules of BAC, you can see exactly which business process instances are affected or at risk. In industries like financial services this matters because the value of transactions can vary hugely, and business operations wants to know which important business instances are affected (e.g. A $10m inter-bank transfer) so that they can initiate manual work-arounds
- Measured user experiences resting on this service. Imagine an app server is having a problem. You can "look upwards" and see that this app server is used to serve the online check-in business service. You can then see the measured impact of the app server problem on the online check-in user experience. This would be measured using either synthetic or real-user monitoring
- Real changes that have occurred under the problem service. The real changes are inferred by the discovery technology that notices deltas between the state of CIs today versus yesterday
- Planned changes against the problem service as taken from the change/release management system
- Outstanding incidents against the problem service. You can "look across" to the details of the incidents to see if they provide the app support team with any insight into how to solve the problem
- Non-compliancy state of servers under the problem service. Our server automation technology now updates server compliance state into the CMDB and this can be viewed in this 360 degree view of the problem service
------------
Mike Shaw.
Posted
01-08-2009 9:51 AM
by
adsey007