While virtualizing servers with VMware has the advantage of minimizing hardware monitoring requirements and saving on facilities costs, it does have the disadvantage of introducing a single point of failure for multiple servers: the ESX host computer. One hardware failure taking out an ESX host can take out all the Virtual Machines (VMs) on that host, rather than taking out one individual server.
To work around this single point of failure, VMware uses High Availability (HA) and Fault Tolerant (FT) cluster configurations. The vSphere Availability Guide provides in depth details on the differences between HA and FT clusters, how they work, and how to configure them- some of the relevant points are:
- HA and FT clustering require that resources be allocated so that they are available for failover- which means these resources are not available for active VMs. As a result, this should only be used for critical servers, with the scale of your implementation depending on your available resources and what your management defines as “critical”.
- ESX host availability is detected through heartbeats from each host – if a heartbeat fails, a heartbeat to the storage device is checked, along with an ICMP ping. Verifying a failure can cause a delay in initiating the failover, but it can prevent unnecessary HA reboots for false positives.
- VMware Tools can be used to set up heartbeats for individual VMs. If the VM heartbeats fail, the disk and network I/O for the VM is checked, and, if there is no I/O for 120 seconds (this time can be adjusted), then the VM is restarted. While this may seem like an excessive delay in determining if a server is hung, it can prevent unnecessary reboots for servers that are experiencing temporary problems.
- HA can either restart a VM that has hung, or migrate a VM to a new ESX host if there is a hardware failure on the host. In each of these cases, the VM is not available until a reboot has completed .
- FT creates a Secondary VM that is identical to the Primary VM, and uses this to make the backup VM available instantly without the need for a reboot that is found with HA clustering. The Secondary VM uses vLockstep to synchronize every operation with the Primary VM, and both servers use heartbeats to monitor each other’s status. If the Primary VM fails, the Secondary VM transparently takes over, and another VM is spun up and synced to re-establish fault tolerance. However – keep in mind that since FT servers are synchronized, if your VM becomes unresponsive due to a software issue, failing over to the Secondary VM won’t fix the problem.
VMware’s HA and FT clustering provide a very good method of addressing the IT equivalent of putting all your eggs in one basket, but they cost resources, and may provide a false sense of security. Yes - your VMs will be much less vulnerable to host server hardware failures - but hardware isn’t the only thing that can go wrong. Applications can still fail, operating systems can be misconfigured, and network bandwidth can get swamped. Make sure your virtualized environment includes not just VMware monitoring.
Make sure to still track application, network, and OS metrics so that you can see problems coming from any direction.