The technology stack continues to increase in both breadth and complexity, to the point where automated server monitoring and alerting is the only way IT organizations can effectively function.
Server monitoring and alerting at its most basic level advances 2 concepts:
That there is a knowledgebase in place that understands what is happening within the server infrastructure
The IT organization is able to effectively leverage that knowledgebase with proactive alerts.
Common Obstacles to monitoring and alerting:
No news is good news
“Everything is running fine and users aren’t complaining I don’t need anything”
As a short term strategy “no news is good news” may well be viable, however, as a long-term strategy, it is entirely impractical.
- There may well be latent performance problems that do not immediately effect users, but cost the organization time and money. For example, an application could be consuming an inordinate or unnecessary amount of server resources – due poorly written code, a patch, or upgrade. Even if resources are available, it doesn’t give IT carte blanche to use them wastefully. IT’s computing and financial resources are almost always bounded and must be allocated as efficiently as possible.
- Additional capacity that has been bought and paid for may not be fully deployed or may be misallocated. By not understanding how existing resources are being used, IT may unknowingly be increasing operating expenses.
Alerting is too much work, reporting is all I need
Collection and reporting of server metrics can be perceived an easier path because access to key performance indicators are natively available in server operating system (i.e. Windows and Linux) and can be exported to a database for further analysis.
- Few if any IT organizations can claim 100% uptime. It is not a question of “if”, it is a question of “when” an outage will occur
- Ultimately a determination has to be made as to what the cost of a failure is and whether the organization can afford not to engage in a server monitoring and alerting strategy
Too big, too complicated, too much time, and shelfware
Server monitoring and alerting need not be a burdensome or a resource intensive process. Whether scripting a solution on your own or using 3rd party technology, critical to a successful implementation taking a targeted approach. Every interaction has to be with the goal of making the most efficient use of limited IT resources:
- Focus only on capabilities that add value and are the most impactful to the organization
- Design and implement with the goal of limiting the ongoing personnel time required to deploy, administer, and maintain
- Ensure that the underlying monitoring and alerting engine uses minimal computing resources
Core principles of effective Server Monitoring:
Know what information to gather
Targeting lowest common denominator metrics related to CPU, Memory, Disk, and Network is fundamental to any server monitoring strategy. However, it is important to consider what the servers’ functions are and to determine whether the servers are delivering their assigned capabilities. For example, are file shares available? how about printers? It is as important to monitor for the “end product” as it is to monitor for performance resource issues.
Know how often to collect server metrics
Not all server metrics are created equal, consider collecting what you need only when you need it, keep it “lightweight” and pay close attention to the worth and volatility of server metrics. Collecting availability metrics frequently allows for the most accurate service level reporting as well as more immediate alerting. Less critical and less volatile server metrics need not be collected as often, keeping the overhead on the servers as well as on the underlying monitoring technology to a minimum.
Know when to evaluate server metrics
How often metrics are collected and evaluated are not mutually exclusive. There is more value in collecting multiple data points for certain metrics (i.e. CPU usage) and evaluating whether there is a resource problem based on an aggregation of the data. The primary advantage of alerting based on accumulated data is the avoidance of false positives, if for example there is a brief CPU spike and the alerting is based on the average of multiple CPU data points then a false alert will not be generated.
Core principles of effective Server Alerting:
Target the alerting
Alert only on critical issues and minimize the number of alerts. Informational alerts or alerts on “problems” that aren’t severe enough to warrant attention should be eliminated.
Where possible correlate issues and work to collapse the alerts. A simple example; if a server fails, IT staff need only an alert that the server is unavailable, not alerts for each service or resource that is unavailable.
In many respects, too many alerts are worse than no alerts at all - as time, effort, and other valuable IT resources have been expended to configure alerts only to have the them ignored!
Escalate alerting based on problem persistence
Escalation means not only alerting different IT staff for persistent issues, but also having alternate notification mechanisms in place. For example, start with an alert to a dashboard, step up to email notifications, and then escalate to texting. Escalation also provides for a natural triaging, ensuring the proper IT resources address the most severe issues.
Also, when automated corrective actions or scripts are deployed make sure to have an escalation policy in place that is based on the success or failure of the automation. It may not be necessary to alert IT staff if automation fixes the problem, conversely if automation is unsuccessful then be timely about alerting IT staff so that they can intervene.
When possible do not rely on existing IT infrastructure for alerting
Make sure that the actual alerting mechanism (i.e. email or texting) is resilient to any failures within the internal IT infrastructure. For example, emailing to a work email address is ineffective if the company email server is down.
A strategy that employs an external email or external texting capability ensures that IT staff are alerted even when the IT infrastructure itself is compromised.
Make sure alerts communicate urgency and remediation
While many alerts can be standardized, critical issues require special attention. When things go horribily wrong make sure the alert recipient understands the urgency, the ramifications, and what to do if they can’t address the issue.
Informative alerts along with built-in escalation also ensure that problems will be addressed with consistency no matter the IT personnel.
Alert multiple IT staff
Whether multiple IT staff are alerted on the initial occurrence of a problem or as the result of escalation, reaching out to more than one IT staff member ensures coverage and accountability of the issue.
Perhaps server monitoring and alerting isn’t quite rocket science, but it will benefit from proper planning. A little upfront homework to determine what metrics to target and how often to go after them will ensure critical issues are identified and also alleviate any concerns related to system and network overhead.
Server monitoring most intrinsic value is being pro-active so take special care to be organized and build out a matrix of who should be alerted about what and when. The alerting should have built-in escalation and also handle IT personnel changes with minimal or no intervention.
Lastly, do not “black box” your server monitoring and alerting! Make sure multiple IT staff are trained on how to configure, maintain, and extend your server monitoring and alerting technology. There are many a failed deployment because the one individual who understood how server monitoring and alerting was implemented moved on. Given the dynamic nature of information technology it is important that there is always core competency available to handle any required adjustments.