As an IT Infrastructure grows in size and complexity implementing infrastructure monitoring becomes increasingly difficult. Increasing scale makes it difficult to drill down from a distributed application performance issue to a resource constriction on one of the application’s back end servers. Mixes of different operating systems, network equipment, software versions, etc., adds to the complexity of requiring multiple collection methods for corresponding metrics and comparing metrics across servers, applications and your network.
Fortunately, regardless of network size or complexity, the same 3 steps can be used to implement infrastructure monitoring:
Monitor the operating systems
Monitoring the operating systems of all your servers provides a snapshot of resource availability. Alerts for low free memory, high disk queue, or other resources provide the ability to proactively address bottlenecks before they affect application performance. Grouping metrics together across multiple backend servers based on a common front end application also provides a way to track application performance issues caused by bottlenecks that may be distributed across several servers.
Data that has been collected for real time performance alerts can also be archived and used for capacity planning. Capacity planning is typically used to extrapolate when additional resources will be needed based on long term usage trends. However archived data can also be used to identify underutilized servers – this can be especially important in allocating resources to virtual machines (VMs) running on Hyper-V or VMware.
The most basic tool for monitoring connectivity is a ping. A ping failure can provide a quick alert for an offline device or network congestion severe enough to hang connections between devices. However pings are prone to intermittent failure due to transient network issues resulting in false alerts. To counter the problem of false alerts make sure you only trigger email or page warnings if there are multiple consecutive failures.
A ping is useful but it does have its limitations – it will tell you if a server is available on a network but it will not determine if a service is available on the server. For example, a SQL server will respond to a ping even if the database is not running. To monitor your ability to connect to resources across servers check that they are listening on their configured port and ideally that they are responding to requests. For a SQL server, run a test SQL query transaction and check that the expected results are returned. For a DNS server, check that a name is resolved correctly. Map out the required connections between your servers and set up test transactions to verify that those resources are accessible.
Monitor system logs
System logs chronicle the activity on servers and network devices and can record everything from a successful logon to a bad block on a physical drive. Log activity ranges in severity from low level Information or Audit Success messages to higher severity Warning, Error, Critical, and Audit Failure messages. Unless you have a specific Informational or Audit Success event that needs to be monitored, avoid the lower severity events that comprise the bulk of log data and focus on the less frequent but more useful messages at Warning or higher severity.
Collecting log data can provide both an early warning for a problem and also provide a forensic analysis of events that happened on a server or network device before a failure. Unix and Linux systems and some network devices use Syslog to manage their logs. The Syslog daemon can be configured to forward specified severity records to remote listeners. The Windows equivalent to Syslog is the Windows Event Log which can be collected remotely through WMI.
These 3 simple monitoring steps will provide an overview of your infrastructure’s health and provide tools that will map application performance issues to underlying resource constraints. The next step after monitoring your infrastructure is application monitoring.