Visit Heroix at http://www.heroix.com
Subscribe to the Heroix eNewsletter
Visit Heroix at http://www.heroix.com
Charting Life in the IT Environment

>> Single Pane of Glass Monitoring - What does it mean today?

August 19, 2008

The notion of a single pane of glass – being able to view your entire network & infrastructure from one console – is not new to IT, but over the years it has come to mean different things to different people. On one level it can be as simple as having a more intelligent view of network connectivity – so, for example, if you lose connectivity to a couple dozen servers at the same time as a result of a router failure, you would receive a single alert, not a couple dozen alerts. That kind of correlation has been a giant leap forward toward root cause analysis, and is applied in a number of ways by monitoring software such as Longitude to help detect and diagnose multi-symptom problems (check out tip 4 in our June 9 blog entry on preventing monitoring false alarms)

Today, with IT organizations now focused on delivering business services, the pane of glass is being viewed from a higher level. Whether the underlying cause is the network, a server, a router, a database, or a web site, IT staff need to know what business activity is compromised so they can respond appropriately. Furthermore, IT needs to be able to report to management on performance from a business perspective.

This requires the ability to collect and correlate an ever widening array of performance and availability metrics, and many organizations find themselves struggling with a piecemeal approach that relies on a patchwork of open source software, shareware, point products, and in-house scripts. Longitude allows you to collect and correlate data from a wide range of sources:

  1. Windows, Unix, and Linux operating systems (including VMware)
  2. Databases, including Microsoft SQL Server, Oracle, and MySQL
  3. Web Servers, including Microsoft IIS and Apache Web Server
  4. Microsoft Exchange Server
  5. J2EE™ Application Servers, including BEA WebLogic®, IBM WebSphere®, and JBoss®
  6. Cisco & any Network Device that uses a MIB
  7. SNMP Traps
  8. DHCP
  9. Infrastructure components, including Active Directory, Citrix, Dell OpenManage™, HP Systems Insight Manager (HP SIM), IBM Director
  10. Protocol Availability
  11. Syslog & Windows Event Logs
  12. End User Experience (Synthetic Web Transactions)
Longitude Event Monitor Showing Business Units - Click to enlarge
Longitude Real Time Statistic Dashboard for ERP Application - Click to enlarge

Longitude can then combine data from any of these sources in tailored Event Displays or Real Time Statistic Dashboards (aka “single pane of glass”) according to the business services you support.

Better Pane of Glass

Furthermore, Longitude actually elevates the pane of glass using Service Level Agreements. A Longitude SLA allows you to group together all the disparate components that work together to support multi-tiered applications that underlie critical business processes, and monitors for degredations in performance or availability of the service as a whole.

For example, if a mission critical application depends on the availability and performance of a web server, application server, back end database, network connectivity and bandwidth, Longitude enables you to define a service level agreement that represents the convergence of all the underlying operational components.

If any single component is down or operating out of acceptable tolerance, it is reflected in the status of the overall SLA. Longitude can then report and alert – in real time or historically – exactly what was out of compliance, for how long, and how severely. This helps you provide better service in several ways:

Longitude SLA for Multi-Tiered Application - Click to enlarge
  1. First, because it incorporates all of the components that support the business service, Longitude eliminates finger-pointing and cuts resolution time by showing you exactly what is causing the problem.
  2. Second, by allowing you to specify degraded as well as unacceptable levels of performance for each component, Longitude can alert you before end users are affected, and even take corrective action if desired.
  3. Third, by allowing IT staff to drill down into underlying issues, Longitude puts actionable information into the right hands.
  4. Finally, by allowing you to annotate SLAs with information about outages and remedies taken (see blue pin in screen shot), SLAs also provide the foundation for more meaningful management reporting.

Register to Download Free SLA Guide
Register to Download Free Reporting Guide


Posted by Heroix Support

>> Resolve problems faster - get proactive with SLAs

July 22, 2008

SLAs (Service Level Agreements) often call to mind images of historical reporting and compliance – essentially documenting performance (or problems with it) after the fact. While that’s part of the picture, it causes many IT organizations without formal reporting requirements to overlook their benefit in proactive monitoring. So if you think you don’t need SLAs, think again.

Traditionally, Service Level Agreements – also called SLAs – have been used to measure the availability of specific services, and report on the percentage of time a given service is up or down. Longitude builds on this concept by allowing you to define Service Level Agreements to track anything from a simple up/down status to the overall health of an entire multi-tiered application. For example, if a mission critical application depends on the availability and performance of a web server, application server, back end database, network connectivity and bandwidth, Longitude enables you to define a service level agreement that represents the convergence of all the underlying operational components.

Click to enlarge

If any single component is down or operating out of acceptable tolerance, it is reflected in the status of the overall SLA. Longitude can then report and alert – in real time or historically – exactly what was out of compliance, for how long, and how severely. This helps you be proactive in two ways. First, because it incorporates all of the components that support the business service, Longitude eliminates finger-pointing and cuts resolution time by showing you exactly what is causing the problem. Second, by allowing you to specify degraded as well as unacceptable levels of performance for each component, Longitude can alert you before end users are affected, and even take corrective action if desired. You can learn more in our SLA Best Practices Guide.

Register to Download Free Guide

Posted by Heroix Support

>> End User Experience - The Elusive Independent Variable

February 11, 2008

While it’s intuitive that end user experience is the most accurate measure of the quality and reliability of IT services, it’s often much less clear how to measure it. Moreover, there is no standard template for integrating an evaluation of user experience into the myriad of other objects monitored and measured. The process can be quite simple though in a scientific context. The goal is to build a picture of cause and effect in our environment that employs end user experience as the overall barometer of application performance and links it to all the potential service delivery problems. In statistical terms the end user experience is our Independent Variable. All the other things that can go wrong are our Dependent Variables, such as a system going down, running out of space, no response from web server or DB, network connectivity issues, etc. Ideally, we want to build a visualization of Independent and Dependent variables together, so that we immediately see the cause and effect relationship between end user experience and measures from many other application and network performance sources. These typically include hardware, operating system, and application performance measures, along with network monitoring and infrastructure tests like PING and port checks, etc. A really good SLA (Service Level Agreement) will include an accurate measure of the end user experience plus criteria that can impact service delivery, thus it’s a cause and effect picture. It not only tells us when we aren’t performing, the good SLA also suggests answers to the question “Why?” when service delivery suffers.

Capturing the elusive Independent Variable is our first goal. Measuring end user experience really means doing something a user does and evaluating success or failure and the time required. Using the example of a common web application model, in Longitude we would simulate a transaction that logs into the web site, navigates to some page and performs some transaction. The result of our Internet solution test becomes our Independent Variable. We can also measure components of end user experience separately by performing additional transaction monitoring tests that measure the web server response and the database response to a query. These are the first components added to our hypothetical SLA. Our list of Dependent Variables includes the potential impediments to service delivery, such as System resources exhausted, network bandwidth consumed, transaction rates, and more. The goal here is to capture enough of a picture to include 90% of the common issues that can arise. This approach to SLA monitoring will enable us to see at a glance what is going wrong when end user experience is sub-standard.

Posted by Chris Smith, Senior Technical Engineer

>> Determining a good threshold for a transaction component in an SLA

July 31, 2007

In my previous blog entry on July 17th I continued talking about defining SLAs. Here I complete that conversation.

In Longitude, you can either use the Statistics Dashboard or the SLA itself to observe transaction response times and get a reasonable first cut estimate for values to use as degraded/unacceptable thresholds. In the statistics dashboard, simply create a widget for the transaction you’re incorporating into your SLA and select “Response Time” as the Statistic to monitor. If you use the SLA itself, the transaction may not be visible in the SLA dashboard if there haven’t been any transaction failures. In that case, select the “Healthy Computers” check box to the left of the SLA detail diagram - the pie chart will be completely green, but each transaction will be listed with an average Response Time value, and an available detail graph displaying a timeline graph of the Response Time values.

A couple of additional notes on Transactions in SLAs: first, you need to register a Transaction in Manage Monitoring before it’s available to either the Statistics Dashboard or SLAs. Second, Ping transaction response times can be very helpful in finding slowdowns in distributed applications. It is possible for all the discrete SLA components to be perfectly well behaved, and not display any problems - but the network connecting them may be preventing the components from communicating effectively. The response time from the Ping Transaction can help to track down phantom performance problems that don’t show up on any of the individual SLA components.

I hope these tips on defining SLAs are helpful in your monitoring strategies. For more information about SLAs you can download our white paper at http://www.heroix.com/aspscript/wp_sla_form.asp.

Posted by Susan Bilder, Senior Technical Consultant

>> SLAs continued: Define the business service

July 17, 2007

In my previous blog entry on June 18th I began talking about defining SLAs. Here I continue that conversation.

When you define an SLA, there are two steps - the first is to break your distributed application into discrete components, and the second is to define acceptable performance levels for each component. While the first step is usually straightforward, (SQL on one server, IIS on another, SMTP on a third server, etc.) the second step can be more difficult.

The Longitude SLA has three levels of performance – acceptable, degraded, and unacceptable. Degraded is actually a subclass of acceptable; while an application is degraded, it is still technically performing acceptably, just not optimally. These three levels of service are incorporated into SLAs when you define the compliance information – there is a “Required percent of time in acceptable state” (which is acceptable plus degraded), and “Required percent of time in good state” (which is only acceptable). So, you need to determine when a particular component is working, but not optimally, versus when a component is just not working acceptably at all.

Determining the thresholds to use for degraded and unacceptable performance levels can be simple. In the definition of the service conditions, some metrics have suggested, best practice degraded/unacceptable thresholds (e.g. CPU Busy Time or Free Memory in the Windows or Unix applications). However, many metrics do not have suggested SLA thresholds, and user defined thresholds are needed for the SLA to make sense.

Not all metrics require thresholds, though. Transactions have discrete Fail/Succeed states that can be used in SLAs. For example, if a ping succeeds, the SLA is acceptable. If it fails, then the SLA is unacceptable. But transactions also collect transaction response times, and this can be very valuable information for an SLA. For example, the round trip time recorded in a ping transaction is a good measure of network latency, or the time it takes for a SQL Query transaction to complete is a good functional measure of database responsiveness. In these cases, SLAs become much more detailed if you assign degraded/unacceptable thresholds to transaction response times rather than just using default fail/succeed values.

The question then becomes - how do you determine a good threshold is for a transaction component in an SLA? Check back soon to learn more about defining SLAs.

Posted by Susan Bilder, Senior Technical Consultant

Next Page »
© 2008 Heroix | Heroix | RSS | Privacy Policy | Email: info@heroix.com