Blog

How to Monitor Server Performance

May 25, 2017 | Ken Leoni

Server performance monitoring can be somewhat of an artform, especially as the server infrastructure and the surrounding network become increasingly dispersed and complex.  Making the determination as to what is “problematic” is an increasingly difficult proposition.

The key components for a successful server monitoring strategy are to identify the key metrics to target,  baseline the metrics so that server performance is properly interpreted for alerting,  and reap additional value from the key metrics via reporting.

How to Monitor Server Performance

What metrics should be targeted:

It is important to target key performance indicators (KPIs) that are specific to the servers’ function:

These key performance metrics serve as a good starting point for any Windows or Unix server monitoring strategy.

Windows Server Performance Metrics
CPU
Process Count The number of processes in the computer.
Thread count The number of threads in the computer.
% Interrupt Time % time the processor spends receiving and servicing hardware interrupts.
%Privileged Time % time that the process threads spent executing code in privileged mode.
% Processor Time % time that the processor spends to execute a non-Idle thread.
% User Time % time the processor spends in the user mode.
Disk
Disk Free % Disk free space percent.
Disk Free Space Disk free space.
Disk Reads/sec The rate of read operations on the disk.
Disk Writes/sec The rate of write operations on the disk.
Disk Read Bytes/sec The rate at which bytes are transferred from the disk during read operations.
Disk Write Bytes/sec The rate at which bytes are transferred to the disk during write operations.
Disk Transfers/sec The rate of read and write operations on the disk.
Memory
Free Memory Amount of free memory.
Page Faults/sec The average number of pages faulted per second. This counter includes both hard faults (those that require disk access) and soft faults (where the faulted page is found elsewhere in physical memory.)
Page Reads/sec The rate at which the disk was read to resolve hard page faults.
Page Writes/sec The rate at which pages are written to disk to free up space in physical memory.
Pages Output/sec The rate at which pages are written to disk to free up space in physical memory.
Pool Nonpaged Bytes The size, in bytes, of the nonpaged pool, an area of system memory (physical memory used by the operating system) for objects that cannot be written to disk, but must remain in physical memory as long as they are allocated.
Pool Paged Bytes The size, in bytes, of the paged pool, an area of system memory (physical memory used by the operating system) for objects that can be written to disk when they are not being used.
Network
Output Queue Length The length of the output packet queue
Packets Outbound Errors The number of outbound packets that could not be transmitted because of errors.
Packets Received Errors The number of inbound packets that contained errors preventing them from being deliverable to a higher-layer protocol.
Kilobytes Received/sec The rate at which kilobytes are received on the network interface.
Kilobytes Sent/sec The rate at which kilobytes are sent on the network interface.

 

Unix Server Performance Metrics
CPU
Process Count The number of processes in the computer.
Thread count The number of threads in the computer.
% Interrupt Time % time the processor spends receiving and servicing hardware interrupts.
%Privileged Time % time that the process threads spent executing code in privileged mode.
% Processor Time % time that the processor spends to execute a non-Idle thread.
% User Time % time the processor spends in the user mode.
Disk
Disk Free % Disk free space percent.
Disk Free Space Disk free space.
Disk Reads/sec The rate of read operations on the disk.
Disk Writes/sec The rate of write operations on the disk.
Disk Read Bytes/sec The rate at which bytes are transferred from the disk during read operations.
Disk Write Bytes/sec The rate at which bytes are transferred to the disk during write operations.
Disk Transfers/sec The rate of read and write operations on the disk.
Memory
Total Physical Memory The total size of physical memory.
Free Memory Amount of free memory
Page Faults/sec The average number of pages faulted per second. This counter includes both hard faults (those that require disk access) and soft faults (where the faulted page is found elsewhere in physical memory.)
Page Reads/sec The rate at which the disk was read to resolve hard page faults.
Page Writes/sec The rate at which pages are written to disk to free up space in physical memory.
Pages Output/sec The rate at which pages are written to disk to free up space in physical memory.
Pool Nonpaged Bytes The size, in bytes, of the nonpaged pool, an area of system memory (physical memory used by the operating system) for objects that cannot be written to disk, but must remain in physical memory as long as they are allocated.
Pool Paged Bytes The size, in bytes, of the paged pool, an area of system memory (physical memory used by the operating system) for objects that can be written to disk when they are not being used.
Network
Output Queue Length The length of the output packet queue.
Packets Outbound Errors The number of outbound packets that could not be transmitted because of errors.
Packets Received Errors The number of inbound packets that contained errors preventing them from being deliverable to a higher-layer protocol.
Kilobytes Received/sec The rate at which kilobytes are received on the network interface.
Kilobytes Sent/sec The rate at which kilobytes are sent on the network interface.

 

Determining a proper monitoring baseline:

Once the appropriate KPI’s are identified the next step is to determine the proper alerting criteria.  While some problems can be readily determined via alerting on a static value (i.e. disk space issues), other problems are more challenging because of the volatility of the KPI’s --especially the KPIs that change based on regular workload patterns.

Baselining is the determination of “usual” behavior and triggering problem notification based on deviations from normal. The guiding principle is to alert on performance and availability problems that are outside the norm.

Determining what is “usual” mean understanding the server’s role and when/how it is being utilized.  For example - Is the server performing a Monday-Friday 9-5 function (i.e. a file/print server or terminal server)?  Is the server a database server that provides services for after-hour batch processing? The baseline characteristics will vary depending on how and when the server is most utilized. Some servers may experience “usual” changes in workload based on time-of-day, day-of-week, or week-of-month and we want to make sure to account for the volatility.

There are a number of approaches to determining a baseline. One approach is to collect the relevant data either via the tools provided by the operating system or a via an application and then analyze the data to determine appropriate baselines.  Taking the time to evaluate your data and learning how your servers are performing (i.e. when and where they experience changes in workload) is an important exercise, as you may well discover unobserved behaviors.   “Normal” patterns aren’t necessarily desirable, for example unwanted batch processing in the middle of the day may in fact be the normal, but certainly not desired. There is immense value in having a first hand understanding the performance patterns of critical hosts, servers and applications.

When determining a baseline, it is critical that the volume of data gathered is enough to be statistically valid. The collection timeframe should span multiple occurrences of any changes or patterns in workload.  You don’t want to rely on too short a time sample and miss a pattern.

Also, do not to limit collecting of key performance indicators to business hours only, as off-prime resources still require careful scrutiny.  For example, if you have nightly processing you’ll want to look for changes in any behavior.  

Here we see a report from Longitude that is helping baseline disk performance on an Exchange Server. We can see that there is a regular pattern (10:04 PM nightly) of a disk having a high queue depth (number of queued read/write operations waiting for disk access) that we’ll want to recognize as normal and NOT a problem as this is when the backup is running. The goal is to avoid alerts of known/normal resource  issues.

 Reporting that helps baseline server performance metrics

A second approach to establishing baseline values for KPIs is to leverage technology that analyzes the data and helps you determine appropriate values for alerting.

Here we see a Longitude providing guidance for CPU usage by a calculating the Minimum, Maximum, and Average for a number of servers.

Automated Baseline Calculation of Server Performance Metrics

Reporting:  Get value from your KPIs

Lastly all the collected data should be leveraged to:  evaluate server performance, observe trends, diagnose bottlenecks, and determine whether the current configuration is performing based on expectations.

Basic Reporting:

Reporting is an integral part of any server monitoring strategy. It is always a good idea to regularly review the performance of your environment even if you don’t believe there are any substantive changes in workload.  It isn’t unusual to experience undesirable behavior after a software upgrade or patch, if you're not diligent about watching for changes in server behavior your IT performance could be compromised.

Having ready access to even the most basic of reports is extremely valuable in helping IT resolve server performance issues:

  • Identify problems related to resource usage
  • Show IT infrastructure and application availability
  • Reveal issues that require attention

Capacity Management:

Evaluating capacity is a continuous process as IT implementations are in a constant state of flux.  IT performance is a moving target because of variations in application activity and the corresponding effects in the IT infrastructure. Capacity planning and management helps:

  • Determine the resources needs to support a projected workload
  • See how changes in hardware will affect application performance
  • Baseline application and IT infrastructure performance

 

 

Compliance:

IT department often have to commit to a specific level of performance and availability. Using KPI data to document compliance to a promised service level helps answer questions like:

  • What percentage of the time are services available?
  • How are the services performing?
  • What is the root cause of outages and degradations in performance?

Conclusion:

If users are flooded with too many alerts it will invariably come to a point where alerts are ignored and even worse missed altogether. Eventually the server monitoring strategy will degrade into a “Boy Who Cried Wolf” situation.

Targeted alerting means setting thresholds so that only actionable alarms are triggered.  Alerts must be meaningful within context of the particular system or application.  For example, a good practice is to work off of a baseline and consider having the minimum, maximum, and average values readily at hand when assessing alerting criteria for performance based thresholds.

In addition, the ability to produce and interpret reports quickly helps shows that IT is  proactive and understands the issues that can potentiality impact the business. Reports are also show the positive impact IT has on an organization's productivity, on-time delivery, and quality by helping:

  • Provide visibility into server and application performance
  • Optimize performance and reduce downtime
  • Show the value of IT
  • Identify areas where IT can make improvements
  • Document compliance and service delivery

Want to learn more?

Download our Best Practices for Server Monitoring Whitepaper and learn how to achieve a successful long-term server monitoring strategy by focusing on an approach that is lightweight, efficient, resilient, and automated.

 

Download the whitepaper: Best Practices for Server Monitoring

 

Sign Up for the Blog

Heroix will never sell or redistribute your email address.