IT is a pervasive presence in businesses, running everything from enterprise level scheduling applications to web based stores. Monitoring the health of the infrastructure that underlies a critical application helps to ensure that it is available when needed, and that it functions as designed. Missing a form submission in a scheduling application due to an overloaded database server results in missed deliveries. Missing a swipe in a smartphone application due to network congestion results in a lost sale. IT applications are the lifeblood of organizations and when the delivery of the information they produce is compromised the consequences can include losses in productivity, revenue, and business intelligence. Effective server monitoring is the key to keeping your IT applications available.
One of the biggest challenges facing IT professionals is managing increasingly sophisticated and heterogenous IT infrastructures. In the past, IT consisted of physical datacenters and servers, usually running a common OS. Management consisted of watching well known key performance indicators (KPIs) and logs.
Fast forward to today. Applications can be dispersed across servers that are themselves dispersed across multiple locations. Servers can be virtualized either on local hosts or in the cloud. Each component has its own set of KPIs and logs, and IT professionals are tasked to do more with less. The IT profession has morphed from specialists handling a finite environment to generalists responsible for implementing and managing technology almost as fast as it is developed.
Server Monitoring Best Practice 1: Monitor KPIs and Availability
Knowing where to start monitoring can be challenging. The first line of defense for monitoring servers is watching for performance and availability issues that have an immediate effect. Critical early warning KPIs are CPU, Memory, Disk, and Network, but their importance and impact will vary depending on the platform. For example, high CPU on a physical server is addressed by looking for processes using more CPU than expected. High CPU on a virtual machine (VM) on a CPU overallocated virtualization host means not only looking for the process consuming the CPU on the VM, but examining actual CPU utilization on the host to check if it is approaching the limit of available CPU and causing CPU Ready % to increase.
|Figure 1. Longitude KPI Overview Dashboard|
Keep the following in mind when monitoring KPIs:
- Scheduled collection intervals should be frequent enough to pick up trends and minimize the effect of transitory spikes. Quickly changing metrics (CPU, Network Activity) should be sampled more frequently than metrics that change relatively slowly (memory, disk free space).
- Start with default KPI thresholds and adjust on a per server basis. For example, database servers can be set to allocate as much free memory as is available on a server, resulting in low free memory without a problem. Archiving KPI values allows you to create a baseline to adjust thresholds appropriately.
- Use an overview dashboard that groups like servers together and allows you to drill down when a problem occurs for that group of servers.
Monitoring server availability is also an integral part of server monitoring. The primary purpose of availability testing is to ensure that the services provided by the server are accessible. Availability tests include:
- Request/response queries to the service provided by the server, e.g. accessing a web page or querying a database.
- Verifying that services/daemons and/or processes are running.
- Evaluating the output of diagnostic scripts.
- Verifying that application ports are listening for request.
- Pinging the server. Please note that pinging a device is the standard for up/down tests, but runs the risk of missing problems on servers that responding to a ping request but whose OS or services are in a hung state.
|Figure 2. Longitude availability transactions.|
Server Monitoring Best Practice 2: Automate discovery of server infrastructure
IT infrastructures encompass a number of components, among them a mix of physical servers, VMs, virtualization hosts and network devices, each of which have unique KPIs and availability metrics. VMs are especially volatile, as they can be spun up or down with little advance notice. The goals for automating infrastructure monitoring are:
- Discover servers, VMs and network devices.
- Monitor discovered devices with appropriate KPIs
- Monitor basic server availability with a ping
- Monitor new VMs as they are created.
Server Monitoring Best Practice 3: Less is more for effective alerts
We’ve explored what needs to be monitored, now let us address what happens once a problem is detected. For alerts, the best practice is “less is more”:
- Alert on critical problems with email or text messages (e.g. web site is down, or database is not responding).
- Enable escalation on less severe persistent problems (e.g. long running SQL queries, or low disk space warnings).
- If a problem can be resolved via automated intervention i.e. a script, try that first and then escalate if the problem if it still exists.
- Minimize transient spikes in volatile KPIs before alerting on a problem. For example, average 3 CPU collections at a 5 minute interval and evaluate the average against the KPI threshold.
|Figure 3. Longitude automated problem correction.|
Server Monitoring Best Practice 4: Maintain Historical Context
As the saying goes, “Those who fail to learn from history are doomed to repeat it”. When approaching server monitoring maintaining a historical context is not only important for capacity planning, but also for recognizing problem patterns related to availability and performance. Does the problem recur? How often? When? And under what circumstance? Knowing the answers to these questions are critical to successfully understanding and mitigating outages.
|Figure 4. Longitude problem events vs. time of day|
Figure 4 displays a problem event summary report for the average event volume generated over the previous 30 days for a monitored vCenter console. The x-Axis represents the hour of the day, the y-axis represents the average event volume per hour. The persistent spikes during the 3:00 AM and 3:00 PM hours indicate a problem with a 12 hour recurrence that might have been missed viewing events over a shorter timescale.
|Figure 5. Longitude historical problem event detail.|
The next step in investigating this pattern is to examine the details for the problems reported during the spikes in Figure 4. The report in Figure 5 shows a regular pattern related to excessive CPU usage on one VM. Further investigation would be to examine problem reports and archived KPI values for the problem VM.
Operating an IT infrastructure that encompasses both physical and virtual resources complicates IT’s task of ensuring maximum availability for business critical applications. Having a cohesive server monitoring strategy is a necessity required to avoid outages that affect productivity and the bottom line.
Server monitoring should:
- Monitor appropriate KPIs in the context of server type and function
- Monitor services provided by servers to identify issues before they impact the organization
- Discover/automate IT monitoring wherever possible
- Limit alerting to high severity and persistent issues and automate responses when possible
- Maintain history and understand problem patterns
Want to learn more: