Visit Heroix at http://www.heroix.com
Subscribe to the Heroix eNewsletter
Visit Heroix at http://www.heroix.com
Charting Life in the IT Environment

>> Lose the False Alarms: 5 Tips for Better Performance Monitoring

June 9, 2008

“We started getting so many alerts we couldn’t tell what to pay attention to.”

Sound familiar?

Unfortunately, any IT monitoring effort can come with a snag: factory settings that are too high, too low, or just not applicable to your workload. Whether you are using commercial software or working with an open source or home-grown monitoring solution, over-notification can actually make you less productive and allow real, sometimes serious, problems to fall through the cracks.

If you are considering implementing a monitoring solution – or looking to improve what you are already doing – here are five common pitfalls and how you can avoid them.

1. Watch out for “one size fits all” thresholds.
Different workloads require different performance thresholds, and unless your monitoring software is tailored to your environment, you will end up with false alarms for applications where high utilization is the norm.

Save yourself from headaches by addressing this the first time you receive what might be considered an “over eager” alert. In Longitude, you can change settings right from the event monitor dashboard, as soon as you see the problem.


Screen shot showing threshold adjustment

Click to enlarge (click enlarged image to focus)

Furthermore, Longitude helps you determine appropriate thresholds by calculating minimum, maximum, and average workload values for any threshold you may need to adjust. This saves you time and takes the guesswork out of configuring Longitude. You can even view workload values and change thresholds globally or for a subset of servers, all in one step. Configuring a few, hundreds or even thousands of servers is quick, easy, and simple.

Screen shot showing minimum, maximum and average workload values - click to enlarge

Click to enlarge (click enlarged image to focus)

2. Filter out “non-problems.”
Just as there may be threshold values specific to your environment, there may also be individual components or even whole classes of problems that you do not want reported. For example, there may be specific Windows services, Unix/Linux file systems, or network interfaces that are not considered mission critical. Longitude allows you to specify filters based on component names as well as performance characteristics, so you can skip data collection for those you do not wish to monitor.

Screen shot showing data collection filter - click to enlarge

Click to enlarge (click enlarged image to focus)

3. Avoid repetitive notification for persistent problems.
Some problems take time to correct. When you or someone else on the staff will be working on an issue for a period of time, repeated reminders are not only unnecessary, but annoying and distracting.

Longitude allows you to suppress notification – again, right from the event monitor – to allow for repair time. If for any reason you decide that an event is not applicable to your environment, you can disable it entirely and should the situation change, you can simply re-enable the event.

Screen shot showing event shutoff - click to enlarge

Click to enlarge

4. Don’t be fooled by multi-symptom problems.
It’s not uncommon for a single problem to exhibit multiple symptoms. For example, if a router is down, it may “look” like all the systems it serves are down, resulting in multiple alerts that are in reality all attributable to the same root cause. Better visibility into underlying causes eliminates event clutter and speeds time-to-resolution.

Using correlated events, Longitude can determine the root cause of a problem and avoid the duplicate notifications. In the case of the router outage, Longitude can recognize this situation by correlating the state of individual servers with the state of the router, and send just one notification (suppressing individual server notifications) if the router malfunctions.

Screen shot showing correlated event - click to enlarge

Click to enlarge

5. Remember: Some problems are time-based.
Depending on when the symptom occurs, an issue may or may not require attention. For example, if your virus scan runs at 1 AM and causes a spike in CPU usage for two hours at that time, you would not want to be notified during that time period. Or, if you need to notify different personnel at different times of day, it makes sense to notify only those staff on duty at any given time. Longitude accomplishes this by allowing you to schedule notifications for different events. You can also have non-problems eliminated from the event database altogether during specified periods such as system maintenance windows.

Screen shot showing notification schedule - click to enlarge

Click to enlarge

Solution or Shelfware?
Automated performance monitoring holds great potential for any IT organization striving to maintain high levels of service for their critical business applications, but experience shows that “factory” settings – even those based on industry best practices – can lead to over-alerting that is annoying, distracting, and counter-productive. Many overwhelmed IT organizations ignore or even decommission monitoring software because it is just too difficult to tune to their unique environment.

As the above examples show, properly tailored monitoring software can filter out false alarms and alert staff to true problems before they affect business processes. This saves them time and money and allows IT to focus on strategic organizational objectives rather than on constantly finding and fixing problems after they’ve occurred.

>> End User Experience - The Elusive Independent Variable

February 11, 2008

While it’s intuitive that end user experience is the most accurate measure of the quality and reliability of IT services, it’s often much less clear how to measure it. Moreover, there is no standard template for integrating an evaluation of user experience into the myriad of other objects monitored and measured. The process can be quite simple though in a scientific context. The goal is to build a picture of cause and effect in our environment that employs end user experience as the overall barometer of application performance and links it to all the potential service delivery problems. In statistical terms the end user experience is our Independent Variable. All the other things that can go wrong are our Dependent Variables, such as a system going down, running out of space, no response from web server or DB, network connectivity issues, etc. Ideally, we want to build a visualization of Independent and Dependent variables together, so that we immediately see the cause and effect relationship between end user experience and measures from many other application and network performance sources. These typically include hardware, operating system, and application performance measures, along with network monitoring and infrastructure tests like PING and port checks, etc. A really good SLA (Service Level Agreement) will include an accurate measure of the end user experience plus criteria that can impact service delivery, thus it’s a cause and effect picture. It not only tells us when we aren’t performing, the good SLA also suggests answers to the question “Why?” when service delivery suffers.

Capturing the elusive Independent Variable is our first goal. Measuring end user experience really means doing something a user does and evaluating success or failure and the time required. Using the example of a common web application model, in Longitude we would simulate a transaction that logs into the web site, navigates to some page and performs some transaction. The result of our Internet solution test becomes our Independent Variable. We can also measure components of end user experience separately by performing additional transaction monitoring tests that measure the web server response and the database response to a query. These are the first components added to our hypothetical SLA. Our list of Dependent Variables includes the potential impediments to service delivery, such as System resources exhausted, network bandwidth consumed, transaction rates, and more. The goal here is to capture enough of a picture to include 90% of the common issues that can arise. This approach to SLA monitoring will enable us to see at a glance what is going wrong when end user experience is sub-standard.

Posted by Chris Smith, Senior Technical Engineer

>> Details on Longitude Packages & Upgrades

January 17, 2008

As you may have read in our recent press announcement, Heroix has released two new Longitude versions. Some have asked what this means for the “old” Heroix Longitude, so I thought I’d offer a little background on each package. The software we know as Heroix Longitude is still alive and well, and is now called Longitude Enterprise Edition, reflecting its full coverage of application performance and network monitoring, with advanced features that facilitate IT monitoring and management in large enterprises. The two new versions – Longitude Standard Edition and Longitude Professional Edition – consist of selected features packaged and priced to meet the needs of smaller and mid-sized businesses.

Longitude Standard Edition provides out-of-the-box operating system and IT infrastructure monitoring that’s affordable for small to medium businesses; it covers Windows (including Server, XP, and Vista), RedHat and SuSE Linux, AIX, HP-UX, Sun Solaris, VMware ESX, Cisco devices, and transactions. It features a web-based user interface, an event monitor for real time monitoring, proactive notification and corrective action, Windows Event Log consolidation, tailorable rules and thresholds, a real time statistics dashboard, and interactive reporting with built-in performance and event reports.

Longitude Professional Edition provides everything found in Standard Edition, plus application performance monitoring and event handling often needed by midsized to larger businesses. It monitors IIS, Apache, Oracle, SQL Server, MySQL, Exchange, DHCP, Active Directory, Dell OpenManage, HP Systems Insight Manager, and IBM Director. Professional Edition also monitors additional transaction types and can import MIBs to monitor any SNMP-based network device. It includes application-specific event monitor views, and can schedule alerts and actions, escalate events, and export performance data. Active Directory can optionally be used for authentication on Windows.

Longitude Enterprise Edition includes all this plus full Service Level Agreement (SLA) monitoring, including alerting, a real time dashboard, and historical reporting. It also features advanced event correlation, a fully customizable Event Monitor (including the ability to define a display based on your own network topology), user experience monitoring (via synthetic web transactions), the ability to send SNMP traps, and J2EE application monitoring. You can schedule reports to run on a daily, weekly, or monthly schedule, and an archived reporting portal allows you to publish reports for viewing by particular types of users (e.g., SLA reports for business managers).

In a nutshell, those are the differences between the new packages. A few people have also asked what happens if they start out with Longitude Standard Edition but then wish to upgrade to either Professional or Enterprise. This is handled easily through the licensing, so if you decide to upgrade, you do not need to reinstall or reconfigure the software. All you need to do is enter a new license key.

Posted by: Dick Levin, VP of Development

© 2008 Heroix | Heroix | RSS | Privacy Policy | Email: info@heroix.com