Visit Heroix at http://www.heroix.com
Subscribe to the Heroix eNewsletter
Visit Heroix at http://www.heroix.com
Charting Life in the IT Environment

>> Correlating Events to Recognize Problems

by Chris Smith on November 15, 2009

Every engineer and manager who receives alerts from automated monitoring systems can relate to both the critical need they fill and to their often annoying short comings. I’m not just referring to a situation where the monitoring has recently been installed, and you haven’t tuned the default thresholds to limit notification to actionable events. The nature of monitoring and alerting, even with the most sophisticated programs, is that problem events are based on very narrow criteria, like a server’s CPU Load, or a router’s bandwidth consumption, or some application’s .NET errors. It’s good to know about these specific problems. But if they are related in a complex problem, that diagnosis can easily be missed when these separate events are surrounded with hundreds or thousands of other random problem events. Recognizing the complex problem is further complicated because the events come from different sources: servers, network appliances, and applications. Of course most real world applications depend on distributed environments for reliable service delivery, and problems occur whose symptoms span multiple devices and programs. If you want to get ahead of the curve to immediately recognize and fix complex problems, then you need to start correlating multiple events so that you can send intelligent notifications that describe the conditions and fixes for complex problems.

What’s The Problem?

Events can be misleading. Consider an example where several servers are behind a switch. We’ll further assume that we are monitoring the availability of the switch and the servers. When the switch goes down, what happens? A ton of notification is sent alerting everyone that all the servers are down, which is effectively true, but isn’t really the problem. Of course eventually the switch down alert comes in with all the server down messages. This is a simple example, where most good engineers will immediately diagnose the problem when they read the switch down alert, but a lot of messages were sent to notify you of the true problem. I always cringe when I know my boss is getting flooded with email that the sky is falling. Now, what if we use some logic in our notification that only sends out server down messages when the switch is OK, and suppresses all the server down messages when the switch goes down? That would be useful. Even better, let’s configure the switch down message to inform recipients with the list of servers that are unavailable due to the switch being down.

The switch example is easy to understand. Think about how useful correlating much more complex events can be, especially when critical information is included in the notification. Fixing a problem is a lot easier if the problem email or text message includes a concise description of the multiple conditions, what the root cause is, how to fix it, and who to communicate with for help. I’ve even worked with customers to include links to their own online SOP or Help Desk documentation. In my experience with problem email alerts, less is always better, as long as you always get all the notification you need. Correlation of events is the only way to simultaneously reduce the email count and dramatically improve the quality of information in alerts.

Logically Speaking

A Correlated Event is going to have multiple conditions. Sometimes all conditions must be true. We also want to be able to recognize when some conditions are true, while others are definitely not true. We may even want to specify that some conditions must be true, while others may be true, and some others must not be true (a really complex event…). We’re really only using three Booleans, AND, OR, and NOT, where we group the logically similar conditions. The logical order of listing the conditions should be:

  1. Conditions that Must Be True
  2. Conditions that May Be True
  3. Conditions that Must Not Be True

The number of conditions can vary, but most complex problems can be recognized based on between 2 and 10 conditions. More may be needed when deducing problems along an extensive network path or across multiple application servers, for example.

I find it useful to include some synchronization and the awareness of persistence when correlating events. First, all conditions might not be based on measurements with the same interval. For example, disk statistics are typically collected hourly, whereas availability is tested every 1 to 5 minutes. Many other conditions will have intervals between these two extremes. I may want to specify that all conditions must happen with a specified interval. It’s also really valuable to be able to specify that X number of events must have happened in the interval. Picture network latency that gets flaky sometimes, but if it persists for X amount of time then it’s a problem.

Intelligent Notification

Once we recognize a complex problem based on the presence or lack of specific conditions we’re in a position to provide effective notification that will maximize the probability that a problem is fixed as quickly as possible. You’re setting up yourself and those around you for success. Here’s how I recommend configuring Correlated Event notification:

  1. Describe the condition set
    1. what it means
    2. what’s the root cause
  2. Describe the procedure to fix the problem
    1. Links to documentation
  3. List the interested parties to contact for help
    1. ISP Contacts
    2. Network Admins
    3. Server Admins
    4. Application Support

How To Do It

You can build a BAT file, VB, Shell, or Perl script to use a CASE test using the Booleans described above, but you’ll have to build an interface to the database of events. You can even use a well crafted query to select for the conditions of interest. If you use Longitude, then you can just use the Correlated Event actions to define the multiple conditions, interval, and persistence, build your notification. Please email me if you have questions about using Correlated Events.

Share this post:
  • E-mail this story to a friend!
  • StumbleUpon
  • Digg
  • del.icio.us
  • Facebook
  • LinkedIn
  • Google
  • Furl

[Post to Twitter] 

2 Comments »

  1. Comment by robboulter
    November 30, 2009 @ 7:37 am

    Hi, When we bought SMARTS we were told (sold) that SMARTS AM (Availability Manger) would do exactly this. However in the three years of supporting it I never saw this correlation performed. Does anybody know whether it actually works?

    Rgds Rob.

  2. Comment by Frank
    March 17, 2010 @ 11:22 am

    Hi Rob,

    seems someone missunderstood the concept behind SMARTS and sold it as the universal “root cause” solution. It does impact/root cause for infrastructure very well but isn’t the universal weapon.

    reg, Frank

RSS feed for comments on this post. RSS must be enabled on your computer.

TrackBack URI

Leave a comment

© 2010 Heroix | Heroix | RSS | Privacy Policy | Email: info@heroix.com