Visit Heroix at http://www.heroix.com
Subscribe to the Heroix eNewsletter
Visit Heroix at http://www.heroix.com
Charting Life in the IT Environment

>> Lose the False Alarms: 5 Tips for Better Performance Monitoring

by admin on June 9, 2008

June 9, 2008

“We started getting so many alerts we couldn’t tell what to pay attention to.”

Sound familiar?

Unfortunately, any IT monitoring effort can come with a snag: factory settings that are too high, too low, or just not applicable to your workload. Whether you are using commercial software or working with an open source or home-grown monitoring solution, over-notification can actually make you less productive and allow real, sometimes serious, problems to fall through the cracks.

If you are considering implementing a monitoring solution- or looking to improve what you are already doing- here are five common pitfalls and how you can avoid them.

1. Watch out for “one size fits all” thresholds.
Different workloads require different performance thresholds, and unless your monitoring software is tailored to your environment, you will end up with false alarms for applications where high utilization is the norm.

Save yourself from headaches by addressing this the first time you receive what might be considered an “over eager” alert. In Longitude, you can change settings right from the event monitor dashboard, as soon as you see the problem.


Screen shot showing threshold adjustment

Click to enlarge (click enlarged image to focus)

Furthermore, Longitude helps you determine appropriate thresholds by calculating minimum, maximum, and average workload values for any threshold you may need to adjust. This saves you time and takes the guesswork out of configuring Longitude. You can even view workload values and change thresholds globally or for a subset of servers, all in one step. Configuring a few, hundreds or even thousands of servers is quick, easy, and simple.

Screen shot showing minimum, maximum and average workload values - click to enlarge

Click to enlarge (click enlarged image to focus)

2. Filter out “non-problems.”
Just as there may be threshold values specific to your environment, there may also be individual components or even whole classes of problems that you do not want reported. For example, there may be specific Windows services, Unix/Linux file systems, or network interfaces that are not considered mission critical. Longitude allows you to specify filters based on component names as well as performance characteristics, so you can skip data collection for those you do not wish to monitor.

Screen shot showing data collection filter - click to enlarge

Click to enlarge (click enlarged image to focus)

3. Avoid repetitive notification for persistent problems.
Some problems take time to correct. When you or someone else on the staff will be working on an issue for a period of time, repeated reminders are not only unnecessary, but annoying and distracting.

Longitude allows you to suppress notification- again, right from the event monitor- to allow for repair time. If for any reason you decide that an event is not applicable to your environment, you can disable it entirely and should the situation change, you can simply re-enable the event.

Screen shot showing event shutoff - click to enlarge

Click to enlarge

4. Don’t be fooled by multi-symptom problems.
It’s not uncommon for a single problem to exhibit multiple symptoms. For example, if a router is down, it may “look” like all the systems it serves are down, resulting in multiple alerts that are in reality all attributable to the same root cause. Better visibility into underlying causes eliminates event clutter and speeds time-to-resolution.

Using correlated events, Longitude can determine the root cause of a problem and avoid the duplicate notifications. In the case of the router outage, Longitude can recognize this situation by correlating the state of individual servers with the state of the router, and send just one notification (suppressing individual server notifications) if the router malfunctions.

Screen shot showing correlated event - click to enlarge

Click to enlarge

5. Remember: Some problems are time-based.
Depending on when the symptom occurs, an issue may or may not require attention. For example, if your virus scan runs at 1 AM and causes a spike in CPU usage for two hours at that time, you would not want to be notified during that time period. Or, if you need to notify different personnel at different times of day, it makes sense to notify only those staff on duty at any given time. Longitude accomplishes this by allowing you to schedule notifications for different events. You can also have non-problems eliminated from the event database altogether during specified periods such as system maintenance windows.

Screen shot showing notification schedule - click to enlarge

Click to enlarge

Solution or Shelfware?
Automated performance monitoring holds great potential for any IT organization striving to maintain high levels of service for their critical business applications, but experience shows that “factory” settings- even those based on industry best practices- can lead to over-alerting that is annoying, distracting, and counter-productive. Many overwhelmed IT organizations ignore or even decommission monitoring software because it is just too difficult to tune to their unique environment.

As the above examples show, properly tailored monitoring software can filter out false alarms and alert staff to true problems before they affect business processes. This saves them time and money and allows IT to focus on strategic organizational objectives rather than on constantly finding and fixing problems after they’ve occurred.

Share this post:
  • E-mail this story to a friend!
  • StumbleUpon
  • Digg
  • del.icio.us
  • Facebook
  • LinkedIn
  • Google
  • Furl

[Post to Twitter] 

No Comments »

No comments yet.

RSS feed for comments on this post. RSS must be enabled on your computer.

TrackBack URI

Leave a comment

© 2010 Heroix | Heroix | RSS | Privacy Policy | Email: info@heroix.com