>> The keys to Effective SLAs
Service Level Agreements are usually the object of desire, fear, and uncertainty all at the same time. They can be such useful tools that it’s important to demystify them. SLAs are desirable because they provide accountability and timely feedback to managers. They are to be feared when they include factors beyond control or that are poorly aligned with reality. SLAs are commonly approached with a high degree of uncertainty about what to measure and how to report results as an effective tool for all parties. While the ingredients in SLAs are as varied as applications and service providers, all effective SLAs share a few critical characteristics.
Good and Bad SLAs
Let’s start by poking fun at what will be the worst example of an SLA you’ve ever heard of or that I’ve been a party to implementing. I should point out this happened long before I became part of the Heroix team. I was brought in to design and implement a monitoring and reporting regime that supported the SLA between a web hosting provider and a Wall Street firm seeking its first web presence. Considering the big-time clients and huge capital expenditure (100 servers in two data centers), I was expecting a challenging assignment with a highly sophisticated and complex set of monitoring requirements. When I received my copy of the SLA, a single paragraph appendix to a large contract, it had one condition:
- No server shall experience greater than 30% average CPU usage during any rolling hour
You could have knocked me over with a feather. After disbelief, hilarity, and confusion, came concern and agitation. I actually suggested that we provide much more, which would have been included in any basic monitoring regime, and was rebuffed. What’s obviously wrong with this SLA is that the measure of success has no direct connection with actual service delivery to consumers. It was, however, very easy to measure. So the primary rule in creating effective SLAs is:
- Measure things that directly impact service delivery or user experience
Our first rule provides the guiding principle in answering the questions, “What should I measure and why should I measure it?” Of course, the WHY part of the answer should always be “Because it directly impacts service. Some good examples WHAT to measure would be:
- Availability of systems and applications
- Success of sessions and transactions
- Web Pages, DB Queries, etc.
- Response Times where applicable
- Loss of resources critical to service delivery
- Disk or DB space, Connection or Session limits
When selecting SLA measures it’s important to choose things that you have control over and that can be measured objectively, even if the statistic is as simple 0 for True and 1 for False, as in the case of whether a required TCP port is accepting connections. Either it is (0) or it isn’t (1). A valuable planning exercise is to picture the data or transaction path, and reserve slots in your SLA for appropriate tests of each potential break point in a service. Using a typical web application example, a consumer connects to a web server, which creates a session on a back end application server, which in turn queries a DB server, ultimately sending a response back to the consumer. In our model the break points are the web, application, and DB servers, plus the network connecting them. By constructing a map of break points to monitor, you place yourself in a position to go beyond simply reporting a service failure by localizing where the service is breaking.
Although I’m sure you get why it’s important to localize the point of failure for a service. It is worth examining the answer. Recall that one of our principle goals is to achieve accountability. That doesn’t just apply to apportioning blame afterwards. It means knowing who owns the component that has failed, and should immediately be given the lead to find a solution. A process of discovery always happens as soon as a service failure is detected. In my experience, quickly identifying who in a group of varied specialists responsible for different technologies should own a problem can be unnecessarily time consuming, if you know what I mean… This is especially true if an SLA is poorly designed and the data are ambiguous as to the cause of the failure. In a well designed SLA with data from each break point and each team member seeing the same picture, it’s usually immediately clear who “owns” the problem. You actually facilitate taking ownership of the problem and effecting a solution.
How to Report SLA Data
Designing and implementing the best SLA will be for naught if you fail to build accessible views of the data that can easily be assimilated into a concept of operations. In other words, you have to build that intuitive picture of your transaction path that everyone’s going to share, and put it somewhere everyone can see it quickly and easily. You may have noted that we are discussing using SLA data in the present tense, as in live presentations. Don’t be confused if you expected an SLA to be some tabular historical report to be compared to contract terms and conditions. An effective SLA is all of these things. What’s the point of identifying what can impact service delivery if we don’t use it as an intensive monitor of the health of our critical application? So let’s use the same data to create live presentations of the application’s current state, while also generating historical reports of compliance with key standards.
There are some types of data that do not lend themselves to live reporting. For example, log data that’s collected nightly. Any type of daily or weekly aggregated data should be relegated to historical reporting. That can include both daily detail reports and long term summaries. Any data that is measured at least hourly can be represented effectively in live presentations or dashboards. Remember we want live data to be fresh (last 5-60 minutes) and not have to wait long periods for one component to refresh the picture again. For historical reporting, we’re really using the same data, just querying for longer periods, like weeks, months, and years (ok, daily if your feeling nervous or obsessive…).
Putting It All Together
A well designed SLA can be a critical tool for managers and technicians. Hopefully the process of automating it in live SLA dashboards and historical reports will actually reduce the workload on administrators. It should dramatically reduce the time normally spent in discovery when service problems arise. The SLA will provide accountability, timely assistance, and a unified picture. It will enable service providers to report proactively how well they are providing service.
Subscribe by RSS






