Your website will go down. It doesn't matter how much redundancy you have built in – if AWS can crash, your servers can as well. Exactly how much you can fix on your own will depend on how much of your infrastructure is under your control and the exact nature of the problem. The following steps are a general troubleshooting outline:
Write up documentation and configure monitoring:
Document your site configuration and use the documentation as an outline for monitoring. Update the documentation and monitoring whenever your site is modified. The exact mix of what to monitor will depend on your configuration. In general, website distributed application monitoring can include:
|Web servers||OS, web applications, web page availability|
|Backend databases||OS, databases applications|
|DNS servers||Name resolution|
|Mail servers||Exchange or SMTP availability|
|Firewalls||Network Traffic, internet availability|
|Reverse proxy servers||Web pages, web applications|
|Load balancers||Web activity on individual web servers|
Most performance metrics can be monitored from within your infrastructure, but web page availability should also be monitored from an external server that connects to your web pages over the internet. Building in a monitored segment that includes the internet will check for outages due to DNS errors or loss of internet connectivity.
Verify alerts are not transient problems.
Temporary glitches in resource availability or network connectivity can cause false alerts. Use the performance monitoring data you’ve collected on your servers to resolve recurring false alerts. For example, if you regularly get website outages overnight, check for backups that might use excessive server resources or too much bandwidth.
Open your website in a browser and check for the following errors:
1. Server Not Found: Your site’s URL is not resolving to its IP address.
- Troubleshoot this using nslookup to manually check name resolution.
- If you host your own DNS servers check that they are resolving correctly and that they aren’t being overwhelmed with a DNS DDS attack.
- If you have a DNS service check that they are not experiencing an outage. If they are experiencing problems you’re limited to checking for updates until the problem is resolved.
- Keep in mind once you or your provider fix DNS it can take up to 24 hours before the change propagates through DNS, so set expectations (and possibly host files) accordingly.
2. Unable to Connect or Webpage is not available: The browser is not able to connect to a web server at the resolved IP address.
- Can you connect to other devices when on your intranet from the outside? Is RDP or VPN working? If you can get in from the outside and the internet connection is up, then the problem is internal to your site.
- If the problem is your connection to the internet check your network equipment to see if anything has crashed or rebooted and hasn’t been reset to the correct configuration. If your equipment appears to be functioning properly then get in touch with your internet provider.
- If the internet connection is not a problem can you access the web page from the intranet? Check the firewall or any reverse proxy servers to see if they’re blocking the page.
- If you can’t see the web pages from the inside check that the web servers are up and the web services are running. If you’ve got a backup server, even if it is underpowered, bring that up to temporarily host the site while you finish troubleshooting the primary server.
3. Error 404: Page not found: The server is responding but the requested resource does not exist.
If you have multiple servers hosting different pages on your website check that the server hosting the requested page is up. For example, if you have a blog in a subdirectory of your web site (e.g. http://www.heroix.com/blog), verify that the blog server is running. When monitoring check web page availability for one page for each separate web server.
4. Error 503: Server Unavailable: The server is unable to respond to the request.
The server may be out of resources or overloaded with requests. Check resource usage on the web servers and check network traffic for possible DDoS attacks.
As mentioned previously – this is a general outline for troubleshooting web site outages. You will need to adapt it to your servers and network configuration. If you’ve got any suggestions for additional browsers errors or troubleshooting steps, please leave them in the comments.