Fighting or friendly, Problem Isolation and OMi - Application Management -
Fighting or friendly, Problem Isolation and OMi

by Michael Procopio

In the post  Event Correlation OMi TBEC and Problem Isolation What's the Difference, my fellow blogger, Jon Haworth, discussed the differences between TBEC and Problem Isolation. To be consistent, I'll use the acronyms PI for Problem Isolation and TBEC to refer to OMi (Operations Manager i series) Topology Based Event Correlation.

Briefly, he mentioned that TBEC works “bottom up”, that is starting from the infrastructure, with events. PI works “top down”, that is, starting from an end user experience problem, primarily with metric (time series) data.

Jon did an excellent job describing TBEC; I’ll do my best on PI because like Jon I have a conscience to settle.

Problem Isolation is a tool to:

1. automate the steps a troubleshooter would go through

2. run additional tests that might uncover the problem

3. look at all metric/performance data from the end user experience monitoring and all the infrastructure it depends

4. find the infrastructure metric the most closely matches the end user problem using behavior learning and regression analysis techniques (developed by HP Labs)

5. bring additional data such as events, help/service desk tickets and changes to the troubleshooter

6. allow the troubleshooter to execute Run books to potentially solve the problem

Potentially the biggest difference in the underlying technology is that Problem Isolation does not require any correlation rules or thresholds to be set for it to do the regression analysis to point to the problem. Like TBEC, it does require that an application be modeled in a CMDB.

An example: Presume a situation with a typical composite application - web server, application server and database. No infrastructure thresholds were violated; therefore, there are no infrastructure alerts. Again, as mentioned in the previous post, end user monitoring (EUM) is the back stop. EUM alerts on slow end user performance, now what?

Here is what Problem Isolation does:

1. determines which infrastructure elements (ITIL configurations items or CIs) support the transaction

2. reruns the test(s) that caused the alert – this validates it is not transient problem

3. runs any additional tests defined for the CIs

4. collects Service Level Agreement information

5. collects all available infrastructure performance metrics (web server, application server, database server and operating systems for each) and compares them to the EUM data using behavior and regression analysis

Problem Isolation screen show performance correlation between end user response and SQL Server database locks

-------------------------------------------------------------------------------------------

6. determines and displays the most probable suspect CI and alternates

7. displays run books available for all infrastructure CIs for the PI user to run directly from the tool

8. allows the PI user to attach all the information to a service ticket, either existing or create a new one

Another key differentiator of OMi/TBEC and PI is the target user. There is such a wide variance in how organizations work that it is hard to name the role but let me do a brief description and I think will be able to determine the title in your organization.

There are some folks in the organization whose job is to take a quick look (typically < 10 minutes, in one organization I interviewed < 1 minute) at a situation and determine if they have explicit instructions on what to do via scripts or run books. When they have no instructions for a situation they pass it on to someone who has a bit more experience and does some free form triage.

This person might be able to fix the problem or may have to pass it on to a subject matter expert, for example if they believe it is an MS Exchange problem to an Exchange admin. It is this second person that Problem Isolation is targeted at. This is helping automate her job, reducing what might take tens of minutes to hours and performing it in seconds. If it ends up she can’t solve the problem it automatically provides full documentation of all information collected. That alone might take someone five minutes to write-up.

OMi’s target is the operations bridge console user. Ops Bridge operators tend to be lower skilled and face hundreds if not thousands of events per hour. Jon described how OMi helps them work smarter.

TBEC and Problem Isolation both work to find the root cause of an incident but in different ways. Much like a doctor might use an MRI or CAT scan to diagnose a patient based on what the situation is, TBEC and Problem Isolation are complementary tools each with unique capabilities.

Problem Isolation will not find problems in redundant infrastructure that OMi will. Conversely, OMi can’t help with EUM problems when no events are triggered, where Problem Isolation will.

We know this can be a confusing area. We welcome your questions to help us do a better job of describing the difference. But these two are definitely friendly.

For Business Availability Center, Michael Procopio

Get the latest updates on our Twitter feed @HPITOps http://twitter.com/HPITOps

Join the HP Software group on LinkedIn and/or the Business Availability Center group on LinkedIn.

Related Items

  1. Advanced analytics reduces downtime costs - detection
  2. Advanced analytics reduces downtime costs – isolation
  3. Problem Isolation page
  4. Operations Manager i page

Posted 09-22-2009 8:27 PM by Michael_Procopio

Comments

jtylerblue wrote re: Fighting or friendly, Problem Isolation and OMi
on 09-25-2009 5:32 PM

Thanks for the blog and the screen shot.

PI however does not use any HP OM Coda or NNMi performance data as of yet. Nor does it use any event data from NA or SA.  Will this be available in a future release? What are the plans for PI in the future? What enhancements or new features will be coming and when could we expect them? In particular I'm looking for it to use Coda and NNMi performance data and event data from NA and/or SA. It would be obviously great to see that a configuration change caused performance degradation.

I was unaware of the Run books - how is this compared to Operations Orchestration? Are the "run books" you are referring to actually OO?

Finally, the Service Desk ticket information - does this work with 3rd party Service Desk's or only HP Service Manager?

Thanks,

Stephen

Michael_Procopio wrote re: Fighting or friendly, Problem Isolation and OMi
on 10-06-2009 3:04 AM

Thanks for your questions Stephen. I'm sorry it took so long for me to reply but I had some research to do.

We have engineers working on adding more data sources. When/if they come out we will be sure to announce it here (and other places).

Unfortunately because of new financial rules I can't disclose specific features, releases or dates. The general direction is more data sources and additional standalone and integrated use cases that cover additional domains.

The Run book feature is referring to Operations Orchestration. PI's value add is to make them available in the same tool. By doing this you don't need to run another tool and navigate around. The OO flows specific to those CIs are directly available to run and we can pass parameters the flows might need from the CMDB without additional user action.

The Service Desk question was what I needed to research. Yes we can support 3rd party Service Desk's provided they are integrated or federated with uCMDB and make the information available. For our own Service Manager, PI asks the CDMB for the information and sends information to it to update CIs and related objects.

Add a Comment

(required)  
(optional)
(required)  
Remember Me?

Type the numbers and letters above:
Powered by Community Server (Non-Commercial Edition), by Telligent Systems