Blog

Monitoring for Availability

June 01, 2017 | Susan Bilder

You’ve set up server monitoring, and your CPU, memory and disk I/O values are all within baselines. The server responds to a ping, and your web services are all running.  But when you try to pull up your company’s home page, you get a “request timeout”. Or, worse, a DNS redirection attack routes your traffic to a site hosting malware.


Monitoring AvailabilitySomewhere in the multiple layers of software and hardware between your web server and your client’s browsers is a problem. In an ideal world, every single possible point of failure is monitored and you’ll find and fix the failure before the website outage is noticed. In the real world there are network elements you don’t control.  There are unpredictable failures.  Time and budget constraints make it impractical to monitor every conceivable point between server and client.


While you can’t completely control the environment between your servers and the clients accessing the resources they provide, you can detect when resources are unavailable - and the sooner you detect the problem, the sooner you can fix it. This blog post will provide general guidelines for availability monitoring and service specific monitoring that can help you identify failures.
 

General Guidelines for Availability Monitoring  

 
Monitor from the location where services are accessed.  

If resources are available across the internet, check from outside your network. If resources are available on your internal network, but across domains or subnets, check for access from those locations.


Don’t make monitoring more complex than necessary.  
You don’t need to monitor every page on a website, or every file on a network share. Select a representative sample to monitor (e.g. a daily log file or a web page with content from a backend database).

Pay special attention to recurring problems.  
Applications and servers can hang, and a service restart or server reboot can restore availability. A simple restart whenever there is a problem is not a long term fix, but it can function as a band aid while you’re drilling down into your logs and KPIs to resolve the problem.  However, do not rely on this as a long term fix for availability problems as it can mask underlying resource or configuration issues.


Make sure there is a problem.  
Network glitches can cause temporary problems that you can’t fix and that will clear up without your intervention. If you have a noisy network, check for persistent issues before alerting.


Prioritize alerts.  
Do not send out the same type of alert for every problem. An overnight outage for a network share used during business hours can be handled by an email to a support queue, while a text message to on call support is more appropriate for an outage of a 24/7 company web page.

 

 

Service Specific Monitoring 

 

Server Specific MonitoringThe monitoring criteria below is listed from the most detailed options to the least. The first item in each category is dependent on the items that follow to work properly, so monitoring the first item will implicitly monitor the others. For example, in the Web sites category, if you verify the contents of a web page, then the page had to have been retrieved successfully (HTTP status = 200), and the server had to have responded to the request on its web server port.

 

Keep in mind that more detailed checks require more configuration to implement and maintain if the resource is altered, but they also provide checks for additional points of failure.

 

Web sites
Page Content
Checking that the text returned on a page matches expected text can identify both outages and page redirections. If your web site uses a database backend, check content managed by the database (e.g. text in a WordPress post).

HTTP Status
A web page should return a HTTP Status 200 - report on any other HTTP Status value.

Web Server ports


Default web server ports are 80 (for http) and 443 (for https), but they can be customized to any free port. Check the ports used by your servers to verify that a web server is listening.

DNS resolution
Name resolution to specific IP Address
Check the result of a name resolution query for a server with a static IP address. If the IP address does not match the expected value, this could indicate a DNS hijack.

Name resolution
For names that do not have static addresses, verify that the DNS server is able to resolve the requested name.

FTP
Check available files
Log in to FTP server and verify files are available.

Log in to FTP server
Verify login is successful.

Check FTP listener port
Verify FTP is listening - default port is 21, but this can be modified.

SMTP
Log in to SMTP server
Verify login is successful.

Check SMTP listener port

 

Verify SMTP server is listening - default port is 25, but this can be modified.

Databases
Query database

 

Test remote connection, login, and query against database. Verify value returned by query is correct.

Connect to database server

 

Verify that it is possible to log in to database server.

Check database server listener port

 

Verify that database is listening for connections. The default port will vary based on the database server, and the server may be using a port other than the default. Common default ports include:

MySQL 3306
Oracle 1521
MS SQL 1433
SSH/Telnet
Connect to server via SSH or Telnet


Log in to the server using SSH or Telnet and verify that a shell is created. If SSH Keys are used for authentication, verify that they work.

Check SSH or Telnet listener ports


Verify that the SSH daemon and inetd daemon for telnet are listening on the correct ports. The default port for SSH is 22, and default port for Telnet is 23.

Network shares
Check files or directories in share
If the files or directories in the share have specific requirements - e.g. maximum size or creation date, verify that the files and directories both exist and meet requirements.

Share availability
Check that the share is available using the specified credentials and over any subnets or AD domains.

 

 

Conclusion  

Server level application monitoring provides the ability to monitor application KPIs and detect problems within an application itself, but it does not provide the ability to detect accessibility problems from remote clients.

Testing an application remotely reveals additional unpredictable points of failure. The implementation of availability monitoring not only helps detect resource outages quickly, but it can pinpoint the point of failure and the steps that need to be taken in order to bring resources back online.


Want to learn more?

Download a FREE trial of Longitude  - Stand up Longitude in just minutes and immediately start seeing how your environment is performing, receive proactive alerts, reveal root causes, automate corrective actions, and more...

 

Start Your Free 30 Day Trial of  Longitude Today!