Determining which metrics to measure for a Service Level Agreement is perhaps the least difficult part of implementing SLA compliance standards. The bigger challenge is how to best go about aggregating, measuring, and visualizing SLA compliance.
In an earlier post , we focused on application performance and availability and we delved into how Service Level Agreements (SLAs) help define and measure the level of service that IT is delivering to its customers. We explored:
- How an SLA template incorporates metrics that end users are particularly concerned about, namely application availability and response time.
- How IT must target key performance indicators (KPI’s) related to IT infrastructure (physical, virtual, cloud), and application performance (database, web, etc.) in order to ensure compliance
It is essential that all performance and availability metrics defined within an SLA template be measurable. Therefore, the value of any technology that collects and measures service level compliance is directly tied to how the SLA compliance data is displayed and how easy it is to act on the data. The technology team needs to be able to quickly visualize and diagnose the severity and persistence of any problems that compromise service level compliance.
In addition, SLA monitoring that provides advanced warning and alerting to IT is particularly valuable, especially if IT is given the opportunity to prevent and, at the very least, mitigate the issues that affect SLA compliance.
IT has to be able to present SLA compliance in a clear and concise way. In Longitude, SLA compliance information is displayed in the form of dashboards and reports.
In its most basic form Longitude Service Level Agreement (SLA) dashboards show the availability and performance expected and required for IT service(s).
Figure 1. Summary SLA Dashboard
The Summary SLA dashboard above aggregates the performance of critical IT infrastructure and application components of a multi-tiered application. The display pinpoints compliance issues that affect application performance and ultimately end user response time. The dashboard’s visual representation quickly shows - for the defined time period:
- Service Availability - What is the overall compliance of the service being delivered?
The pie chart shows overall SLA compliance , we clearly have an issue or set of issues that require attention.
- Service Availability by Hour - What is the compliance per hour?
Each vertical bar each represent compliance per hour. We want to see if there is a particular pattern. For example, is the compliance problem specific to only certain hours of the day?
- Service Condition Availability - Which KPIs are out of compliance, how badly, and for how long?
We see here that there is a correlation between web and database response time and health of the SQL infrastructure, giving IT a starting point from which to further diagnose database issues.
Defining an SLA Template
An SLA dashboard is based on the creation of an SLA Template. The Longitude SLA Template is comprised of “service conditions” - each service condition measures a key performance indicator (KPI) or set of KPIs for availability and/or performance.
Figure 2. Longitude SLA Template with service conditions
We can see that we've defined multiple service conditions for our multi-tiered application as part of an SLA Template. In this instance the SLA is evaluating:
- Web Response time and application availability for end user metrics
- Network and virtual infrastructure KPI’s for IT
- Application performance KPIs related to IIS and SQL
Figure 3. Defining a simple service condition in Longitude
Here we are defining a service condition that is based on a single KPI related to web response time. Longitude is configured to:
- Visit a web page https://intranet
- Constantly execute an internet macro “CheckWebService” that synthetically navigates through https://intranet
- Verify both proper content is returned and the timing of the transaction. A response time greater than ¾ of a second constitutes degraded behavior and longer than 1.5 seconds constitutes unacceptable behavior.
Diagnosing SLA Compliance Issues
When Longitude evaluates a service condition - a number of outcomes are possible
|Good||Service available and operating as expected|
|Degraded||Service is available and is operating at a less than acceptable level of performance.|
|Unacceptable||Service is available, but is operating at an unacceptable level of performance.|
|Maintenance||Service is unavailable due to scheduled or requested maintenance.|
|Down||Service is unavailable when it should be available|
Figure 4. Summary SLA dashboard
The Longitude SLA dashboard above clearly shows that there is an issue affecting SQL Health and that both Web Response time and Database Response time are impacted. Upon closer analysis, we can also see there is close correlation between Web Response time and Database Response time.
We will want to drill down further and make a determination as to what is causing SQL Health issues. We see quite a bit of degraded and unacceptable behavior.
You will notice that the SLA dashboard is showing a value for "Acceptable", but it is not listed as an outcome in the table above. Acceptable is the SLA compliance value that is to be presented to the end user.
Acceptable is a calculated value of Good + Degraded and is shown in green
The Degraded State is a warning to IT - letting them know that the SLA is approaching non-compliance and that some sort of intervention is required.
Figure 5. Detail from SLA dashboard
Upon further analysis (drilling down further in the Longitude dashboard) we have identified that there is a CPU consumption problem as the processor queue length is exceeding our “good” value of “3”. Using the information provided here, along with the Longitude’s built-in knowledge base, we can zero in on what is consuming excessive CPU time.
Alerting on SLA non-compliance
Service Level Agreement templates, particularly ones that are defined for use by an IT organization, should never be binary in nature, rather they should be constructed with graduated compliance thresholds. Service conditions entering into a degraded state are indicative of an impending problem and should provide enough warning so that IT can resolve the issue(s) before the SLA goes out of compliance. Remember, IT staff are not looking at a dashboard all the time, therefore having a mechanism in place to page or email about compliance issues is invaluable.
This SLA has been out of compliance for some time now, and if we had configured automated notification, actions could have been taken to address the problem and make compliance values dramatically better. Again, the principle here is to be notified before the SLA is out of compliance.
Figure 6. Alerting on SLA compliance
It is also a good practice to put built in escalation into place (i.e. notification based on severity or persistence). A brief period of degradation might be a one off problem that won't occur again and may not warrant the same level of attention that increased instances of degraded behavior would require. Also, down or unacceptable behaviors often indicate more severe problems that require different skill sets to resolve as compared to resolving degraded behavior issues.
Reporting SLA compliance
Figure 7. Summary SLA compliance report
Historical context for SLA compliance is critical
- Reporting provides an objective view of SLA compliance. Reports can show exactly what times of the day or days of the week compliance is an issue, and by drilling down into the reports more details can be revealed about the severity, persistence, and nature of any existing problems.
- Reporting is essential in identifying patterns of non-compliance. Is compliance a problem during specific hours of the day or days of the week? The above report displays the following over a 4 day period:
- Overall Service Availability and compliance
- Average compliance per hour across the 4 days
- Average compliance per day for each of the 4 days
Correlating user experience metrics with underlying infrastructure and application metrics supporting associated business services is a critical component to SLA monitoring. SLA compliance technology that includes dashboards, alerts, and reports helps keep IT ahead of potential problems and enables them to make better informed business decisions based on a more complete view of business application performance.
Want to learn more?
Download our Best Practices Guide to Developing and Monitoring SLAs - Learn how your organization can minimize the resources needed for SLA management and more readily align IT's services with the needs of the business.