You’ve set up server monitoring, and your CPU, memory and disk I/O values are all within baselines. The server responds to a ping, and your web services are all running. But when you try to pull up your company’s home page, you get a “request timeout”. Or, worse, a DNS redirection attack routes your traffic to a site hosting malware.
Somewhere in the multiple layers of software and hardware between your web server and your client’s browsers is a problem. In an ideal world, every single possible point of failure is monitored and you’ll find and fix the failure before the website outage is noticed. In the real world there are network elements you don’t control. There are unpredictable failures. Time and budget constraints make it impractical to monitor every conceivable point between server and client.
While you can’t completely control the environment between your servers and the clients accessing the resources they provide, you can detect when resources are unavailable - and the sooner you detect the problem, the sooner you can fix it. This blog post will provide general guidelines for availability monitoring and service specific monitoring that can help you identify failures.
General Guidelines for Availability Monitoring
Monitor from the location where services are accessed.
If resources are available across the internet, check from outside your network. If resources are available on your internal network, but across domains or subnets, check for access from those locations.
Don’t make monitoring more complex than necessary.
You don’t need to monitor every page on a website, or every file on a network share. Select a representative sample to monitor (e.g. a daily log file or a web page with content from a backend database).
Pay special attention to recurring problems.
Applications and servers can hang, and a service restart or server reboot can restore availability. A simple restart whenever there is a problem is not a long term fix, but it can function as a band aid while you’re drilling down into your logs and KPIs to resolve the problem. However, do not rely on this as a long term fix for availability problems as it can mask underlying resource or configuration issues.
Make sure there is a problem.
Network glitches can cause temporary problems that you can’t fix and that will clear up without your intervention. If you have a noisy network, check for persistent issues before alerting.
Do not send out the same type of alert for every problem. An overnight outage for a network share used during business hours can be handled by an email to a support queue, while a text message to on call support is more appropriate for an outage of a 24/7 company web page.
Service Specific Monitoring
The monitoring criteria below is listed from the most detailed options to the least. The first item in each category is dependent on the items that follow to work properly, so monitoring the first item will implicitly monitor the others. For example, in the Web sites category, if you verify the contents of a web page, then the page had to have been retrieved successfully (HTTP status = 200), and the server had to have responded to the request on its web server port.
Keep in mind that more detailed checks require more configuration to implement and maintain if the resource is altered, but they also provide checks for additional points of failure.
Checking that the text returned on a page matches expected text can identify both outages and page redirections. If your web site uses a database backend, check content managed by the database (e.g. text in a WordPress post).
A web page should return a HTTP Status 200 - report on any other HTTP Status value.
|Web Server ports||
|Name resolution to specific IP Address||
Check the result of a name resolution query for a server with a static IP address. If the IP address does not match the expected value, this could indicate a DNS hijack.
For names that do not have static addresses, verify that the DNS server is able to resolve the requested name.
|Check available files||
Log in to FTP server and verify files are available.
|Log in to FTP server||
Verify login is successful.
|Check FTP listener port||
Verify FTP is listening - default port is 21, but this can be modified.
|Log in to SMTP server||
Verify login is successful.
|Check SMTP listener port||
Verify SMTP server is listening - default port is 25, but this can be modified.
Test remote connection, login, and query against database. Verify value returned by query is correct.
|Connect to database server||
Verify that it is possible to log in to database server.
|Check database server listener port||
Verify that database is listening for connections. The default port will vary based on the database server, and the server may be using a port other than the default. Common default ports include:
|Connect to server via SSH or Telnet||
|Check SSH or Telnet listener ports||
|Check files or directories in share||
If the files or directories in the share have specific requirements - e.g. maximum size or creation date, verify that the files and directories both exist and meet requirements.
Check that the share is available using the specified credentials and over any subnets or AD domains.
Server level application monitoring provides the ability to monitor application KPIs and detect problems within an application itself, but it does not provide the ability to detect accessibility problems from remote clients.
Testing an application remotely reveals additional unpredictable points of failure. The implementation of availability monitoring not only helps detect resource outages quickly, but it can pinpoint the point of failure and the steps that need to be taken in order to bring resources back online.
Want to learn more?
Download a FREE trial of Longitude - Stand up Longitude in just minutes and immediately start seeing how your environment is performing, receive proactive alerts, reveal root causes, automate corrective actions, and more...