Visit Heroix at http://www.heroix.com
Subscribe to the Heroix eNewsletter
Visit Heroix at http://www.heroix.com
Charting Life in the IT Environment

>> Resolve problems faster - get proactive with SLAs

July 22, 2008

SLAs (Service Level Agreements) often call to mind images of historical reporting and compliance – essentially documenting performance (or problems with it) after the fact. While that’s part of the picture, it causes many IT organizations without formal reporting requirements to overlook their benefit in proactive monitoring. So if you think you don’t need SLAs, think again.

Traditionally, Service Level Agreements – also called SLAs – have been used to measure the availability of specific services, and report on the percentage of time a given service is up or down. Longitude builds on this concept by allowing you to define Service Level Agreements to track anything from a simple up/down status to the overall health of an entire multi-tiered application. For example, if a mission critical application depends on the availability and performance of a web server, application server, back end database, network connectivity and bandwidth, Longitude enables you to define a service level agreement that represents the convergence of all the underlying operational components.

Click to enlarge

If any single component is down or operating out of acceptable tolerance, it is reflected in the status of the overall SLA. Longitude can then report and alert – in real time or historically – exactly what was out of compliance, for how long, and how severely. This helps you be proactive in two ways. First, because it incorporates all of the components that support the business service, Longitude eliminates finger-pointing and cuts resolution time by showing you exactly what is causing the problem. Second, by allowing you to specify degraded as well as unacceptable levels of performance for each component, Longitude can alert you before end users are affected, and even take corrective action if desired. You can learn more in our SLA Best Practices Guide.

Register to Download Free Guide

Posted by Heroix Support

>> Lose the False Alarms: 5 Tips for Better Performance Monitoring

June 9, 2008

“We started getting so many alerts we couldn’t tell what to pay attention to.”

Sound familiar?

Unfortunately, any IT monitoring effort can come with a snag: factory settings that are too high, too low, or just not applicable to your workload. Whether you are using commercial software or working with an open source or home-grown monitoring solution, over-notification can actually make you less productive and allow real, sometimes serious, problems to fall through the cracks.

If you are considering implementing a monitoring solution – or looking to improve what you are already doing – here are five common pitfalls and how you can avoid them.

1. Watch out for “one size fits all” thresholds.
Different workloads require different performance thresholds, and unless your monitoring software is tailored to your environment, you will end up with false alarms for applications where high utilization is the norm.

Save yourself from headaches by addressing this the first time you receive what might be considered an “over eager” alert. In Longitude, you can change settings right from the event monitor dashboard, as soon as you see the problem.


Screen shot showing threshold adjustment

Click to enlarge (click enlarged image to focus)

Furthermore, Longitude helps you determine appropriate thresholds by calculating minimum, maximum, and average workload values for any threshold you may need to adjust. This saves you time and takes the guesswork out of configuring Longitude. You can even view workload values and change thresholds globally or for a subset of servers, all in one step. Configuring a few, hundreds or even thousands of servers is quick, easy, and simple.

Screen shot showing minimum, maximum and average workload values - click to enlarge

Click to enlarge (click enlarged image to focus)

2. Filter out “non-problems.”
Just as there may be threshold values specific to your environment, there may also be individual components or even whole classes of problems that you do not want reported. For example, there may be specific Windows services, Unix/Linux file systems, or network interfaces that are not considered mission critical. Longitude allows you to specify filters based on component names as well as performance characteristics, so you can skip data collection for those you do not wish to monitor.

Screen shot showing data collection filter - click to enlarge

Click to enlarge (click enlarged image to focus)

3. Avoid repetitive notification for persistent problems.
Some problems take time to correct. When you or someone else on the staff will be working on an issue for a period of time, repeated reminders are not only unnecessary, but annoying and distracting.

Longitude allows you to suppress notification – again, right from the event monitor – to allow for repair time. If for any reason you decide that an event is not applicable to your environment, you can disable it entirely and should the situation change, you can simply re-enable the event.

Screen shot showing event shutoff - click to enlarge

Click to enlarge

4. Don’t be fooled by multi-symptom problems.
It’s not uncommon for a single problem to exhibit multiple symptoms. For example, if a router is down, it may “look” like all the systems it serves are down, resulting in multiple alerts that are in reality all attributable to the same root cause. Better visibility into underlying causes eliminates event clutter and speeds time-to-resolution.

Using correlated events, Longitude can determine the root cause of a problem and avoid the duplicate notifications. In the case of the router outage, Longitude can recognize this situation by correlating the state of individual servers with the state of the router, and send just one notification (suppressing individual server notifications) if the router malfunctions.

Screen shot showing correlated event - click to enlarge

Click to enlarge

5. Remember: Some problems are time-based.
Depending on when the symptom occurs, an issue may or may not require attention. For example, if your virus scan runs at 1 AM and causes a spike in CPU usage for two hours at that time, you would not want to be notified during that time period. Or, if you need to notify different personnel at different times of day, it makes sense to notify only those staff on duty at any given time. Longitude accomplishes this by allowing you to schedule notifications for different events. You can also have non-problems eliminated from the event database altogether during specified periods such as system maintenance windows.

Screen shot showing notification schedule - click to enlarge

Click to enlarge

Solution or Shelfware?
Automated performance monitoring holds great potential for any IT organization striving to maintain high levels of service for their critical business applications, but experience shows that “factory” settings – even those based on industry best practices – can lead to over-alerting that is annoying, distracting, and counter-productive. Many overwhelmed IT organizations ignore or even decommission monitoring software because it is just too difficult to tune to their unique environment.

As the above examples show, properly tailored monitoring software can filter out false alarms and alert staff to true problems before they affect business processes. This saves them time and money and allows IT to focus on strategic organizational objectives rather than on constantly finding and fixing problems after they’ve occurred.

>> Monitoring non-Cisco devices

June 3, 2008

Q: Longitude has a built-in solution for monitoring Cisco network devices. Can Longitude monitor non-Cisco network devices?

A: Yes. Longitude’s built-in Cisco solution uses a standard RFC1213 MIB for data collection, so in many cases, you can use it to monitor non-Cisco devices out of the box. The Cisco solution proactively monitors key performance metrics including bandwidth utilization, IP packet errors, TCP errors, TCP retransmits, UDP errors, queue lengths, etc. Longitude alerts you when there is a problem, and also provides pre-configured, on-demand reports and graphs to help you understand performance trends and ensure maximum availability. For more information about built-in monitoring of Cisco and other network devices, please consult the Data Sheet for the Cisco Solution (http://www.heroix.com/downloads/pdf/Longitude_Network.pdf).

If you wish to monitor items not collected by the built-in solution, then Longitude’s SNMP Studio enables you to monitor any SNMP-based device or application, including switches, routers and other hardware devices, as well as middleware and custom applications. The SNMP Studio also provides an interface for browsing Management Information Base (MIB) files. SNMP Studio comes pre-loaded with a variety of MIBs, and additional MIBs can be added easily. Creating a solution in SNMP Studio is as simple as browsing a MIB tree to select SNMP objects for collection and then filling in brief forms in order to configure integrated events and reports for Longitude to create. For more information, please consult the Data Sheet for the SNMP Studio (http://www.heroix.com/downloads/pdf/Longitude_SNMP_Studio.pdf).

Posted by Alison Murphy, Senior Technical Support Engineer

>> Managing Longitude Database Size

March 25, 2008

Q: How do I manage the size and disk usage of the Longitude database?

A: Longitude uses an open source SAP database, which is automatically created with 3 GB allocated on the drive you choose when you install Longitude. The database will auto-expand on that drive when it reaches either 80% full or less than 100MB free. You can manually extend it on the same drive, but that is rarely needed given Longitude’s self-maintaining features.

Do not gauge database consumption based on what Windows shows as the size of the \Longitude\sapdb\indep_data\wrk\FZEDB1\DATA0001 file. Even if Windows shows that file at 3 GB, the database is not necessarily nearing the full 3 GB allocated.

You can check the consumption by logging into the WebDbm:

http://localhost:7230/webdbm

u: dbm
p: {the password you specified for the original Longitude user during installation}

See the screen shot below.

Posted by Greg Savas, Technical Support Engineer

Screen shot showing database size

>> End User Experience - The Elusive Independent Variable

February 11, 2008

While it’s intuitive that end user experience is the most accurate measure of the quality and reliability of IT services, it’s often much less clear how to measure it. Moreover, there is no standard template for integrating an evaluation of user experience into the myriad of other objects monitored and measured. The process can be quite simple though in a scientific context. The goal is to build a picture of cause and effect in our environment that employs end user experience as the overall barometer of application performance and links it to all the potential service delivery problems. In statistical terms the end user experience is our Independent Variable. All the other things that can go wrong are our Dependent Variables, such as a system going down, running out of space, no response from web server or DB, network connectivity issues, etc. Ideally, we want to build a visualization of Independent and Dependent variables together, so that we immediately see the cause and effect relationship between end user experience and measures from many other application and network performance sources. These typically include hardware, operating system, and application performance measures, along with network monitoring and infrastructure tests like PING and port checks, etc. A really good SLA (Service Level Agreement) will include an accurate measure of the end user experience plus criteria that can impact service delivery, thus it’s a cause and effect picture. It not only tells us when we aren’t performing, the good SLA also suggests answers to the question “Why?” when service delivery suffers.

Capturing the elusive Independent Variable is our first goal. Measuring end user experience really means doing something a user does and evaluating success or failure and the time required. Using the example of a common web application model, in Longitude we would simulate a transaction that logs into the web site, navigates to some page and performs some transaction. The result of our Internet solution test becomes our Independent Variable. We can also measure components of end user experience separately by performing additional transaction monitoring tests that measure the web server response and the database response to a query. These are the first components added to our hypothetical SLA. Our list of Dependent Variables includes the potential impediments to service delivery, such as System resources exhausted, network bandwidth consumed, transaction rates, and more. The goal here is to capture enough of a picture to include 90% of the common issues that can arise. This approach to SLA monitoring will enable us to see at a glance what is going wrong when end user experience is sub-standard.

Posted by Chris Smith, Senior Technical Engineer

Next Page »
© 2008 Heroix | Heroix | RSS | Privacy Policy | Email: info@heroix.com