Event Correlation: OMi TBEC and Problem Isolation - What's the difference (part 2 of 3) - Infrastructure Management Software Blog -
Event Correlation: OMi TBEC and Problem Isolation - What's the difference (part 2 of 3)

If you have not done so already, you may want to start with part 1 in this series.
http://www.communities.hp.com/online/blogs/managementsoftware/archive/2009/09/25/event-correlation-omi-tbec-and-problem-isolation-what-s-the-difference-part-1-of-3.aspx

This is part 2 of 3 of my discussion of the event correlation technologies within OMi Topology Based Event Correlation (TBEC) and Problem Isolation. I'm going to focus on talking about how TBEC is used and how it helps IT Operations Management staff be more effective and efficient. My colleague Michael Procopio has discussed PI in more detail over in the BAC blog here: PI and OMi TBEC blog post 

If you think about an Operations Bridge (or "NOC"… but I've blogged my opinion of that term previously) then fundamentally its purpose is very simple.

 

The Ops Bridge is tasked with monitoring the IT Infrastructure (network, servers, applications, storage etc.) for events and resource exceptions which indicate a potential or actual threat to the delivery of the business services which rely on the IT infrastructure. The goal is to fix issues as quickly as possible in order to reduce the occurrence or duration of business service issues.

 

Event detection is an ongoing process 24x7 and the Ops Bridge will monitor the events during all production periods, often 24x7 using shift based teams.

 

Event monitoring is an inexact discipline. In many cases a single incident in the infrastructure will result in numerous events – only one of which actually relates to the cause of the incident, the other events are just symptoms.

 

The challenge for the Ops Bridge staff is to determine which events they need to investigate and to avoid chasing the symptom events. The operations team must prioritize their activities so that they invest their finite resources in dealing with causal events based on their potential business impact, and avoid wasting time in duplication of effort (chasing symptoms) or, even worse, in chasing symptoms down in a serial fashion before they finally investigate the actual causal event, as this will extend the potential for extended downtime of business services.

 

TBEC helps the Operations Bridge in addressing these challenges. TBEC works 24x7, examining the event stream, relating it to the monitored infrastructure and the automatically discovered dependencies between the monitored components. TBEC works to provide a clear indication that specific events are related to each other (related to a single incident) and to identify which event is the causal event and which are symptoms.

 

Consider a disk free space issue on a SAN, which is hosting an oracle database. With comprehensive event monitoring in place, this will result in three events:

  • a disk space resource utilization alert
  • quickly be followed by an Oracle database application error
  • and a further event which indicates that a Websphere server which uses the Oracle database is unhappy

 

Separately, all three events seem ‘important’ – so considerable time could be wasted in duplicate effort as the Ops Bridge tries to investigate all three events. Even worse, with limited resources, it is quite possible that the Operations staff will chase the events ‘top down’ (serially) – look at Websphere first, then Oracle, and finally the SAN – this extends the time to rectification and increases the duration (or potential) of a business outage.

 

TBEC will clearly show that the event indicating the disk space issue on the SAN is the causal event – and the other two events are symptoms.

 

In a perfect world the Ops Bridge can monitor everything, detect every possible event or compromised resource that might impact a business service and fix everything before a business service impact occurs.

 

The introduction of increasingly redundant and flexible infrastructure helps with this – redundant networks, clustered servers, RAID disk arrays, load balanced web servers etc. But, it also can add complications which I’ll illustrate later.

 

One of the challenges of event monitoring is that it simply can NOT detect everything that can impact business service delivery. For example, think about a complex business transaction, which traverses many components in the IT infrastructure. Monitoring of each of the components involved may indicate that they are heavily utilized – but not loaded to the point where an alert is generated.

 

However, the composite effect on the end to end response time of the business transaction may be such that response time is simply unacceptable. For a web based ordering system where customers connect to a company’s infrastructure and place orders for products this can mean the difference between getting orders or the customer heading over to a competitors web site.

 

This is why End User Monitoring technologies are important. I'll talk about EUM in the next, and final, edition of this blog serial.

Read part 3 in the series.
http://www.communities.hp.com/online/blogs/managementsoftware/archive/2009/09/25/event-correlation-omi-tbec-and-problem-isolation-what-s-the-difference-part-3-of-3.aspx

For HP Operations Center,  Jon Haworth.


Posted 09-25-2009 3:49 PM by jonhaworth
Powered by Community Server (Non-Commercial Edition), by Telligent Systems