Blog

Ensuring Virtual Machine Availability

October 04, 2017 | Susan Bilder

Assuring Virtual Machine AvailabilityWhether due to a hardware failure on a host server or a software failure on a virtual machine (VM), VMs can and will crash. There are ways to automate dealing with crashes so that VMs are back online with minimal delays and the services they provide remain available. Today’s post will outline methods you can use to keep VMs and their services available while using VMware or Hyper-V.


Automatic Failover


VMware and Hyper-V offer automatic failover and high availability (HA) options for VMs. At a minimum both platforms require the hosts to be configured in a cluster and that the cluster use shared storage. Specific details for configuring failover clusters are available at:

VMware vSphere Availability

Deploy a Hyper-V Cluster

For automatic failover due to a host hardware failure, each host has a service or agent that provides a heartbeat indicating the the host is alive. If the host heartbeat fails, VMs that have been configured to failover will start on on a new host.

Automatic failover can also be used to guard against OS hangs or crashes in the VM. Installing Integration Services in Hyper-V or VMware Tools in VMware will enable a heartbeat on the VM that is monitored in vSphere or Hyper-V.  The heartbeat fails if the VM hangs or crashes, and the missed heartbeat triggers a VM restart. 

Automatic failover comes with risks:
  1. Failover is not instantaneous - it will take some time for the heartbeat failure to be detected and for the new VM instance to boot up.

  2. The new instance of the VM will not be an exact replica of the previous instance since the new instance does not have access to the contents of the memory of the previous VM at the time the host failed.

For some servers, like file servers, the loss of continuity and minimal downtime should cause little if any disruption. For other servers, like database servers, downtime and loss of continuity can cascade effects through mulitple applications.

An added benefit of configuring failover clusters is that they are also used as the basis for providing resource balancing features - Distributed Resource Scheduling (DRS) in VMware and Virtual Machine Load Balancing in Hyper-V.  Resource balancing automatically selects the best placement of VMs on hosts by continuosly  monitoring resource usage across all hosts in the cluster, and seamlessly migrating VMs to different hosts when needed to maintain optimal resource distribution.

 

Fault Tolerance

Fault Tolerance (FT) is a VMware feature that addresses the delay and continuity problems associated with HA. In a FT configuration a primary and secondary VM are run on two different hosts and are continually synchronized so that "hot spare" of the VM is available to take over instantaneously in the event of a host hardware crash.

FT works as follows:

  1. When an FT VM starts a primary and secondary instance of the VM are started on separate hosts, and the VMs are synchronized.

  2. The primary VM monitors the secondary VM. If the secondary VM is not available the primary VM starts a new secondary VM on a different host.

  3. The secondary VM monitors the primary VM. If the primary VM is not available  the secondary VM becomes the new primary VM, and starts up a new secondary VM on a different host.

  4. Since there is always a hot spare secondary VM, the VM can be immediately failed over with no interruption of services.


FT does have its drawbacks:
  1. Since you are always running a backup VM instance, FT requires significantly more resources than high availability.  FT is best used for mission critical VMs that merit additional resource usage.

  2. If the OS on a FT server crashes or hangs, both the primary and secondary VMs will crash or hang as well since they are synchronized.  FT VMs should be monitored for availability to guard against this possibility.



Software Clusters 
Software clusters distribute an application across multiple VMs
Software clusters do not address host hardware failure, but they can spread an application across multiple VMs. If one or more VMs or hosts crash, an application installed on a cluster loses any nodes associated with the crashed VMs.  However, the application will still be available through the remaining VMs in the cluster.


Microsoft Cluster Services (MSCS) has been available since early versions of Windows, with Hyper-V providing integrated support and VMware providing guidelines for implementing specific versions of MSCS and vSphere. The most recent implementation of MSCS is Windows Failover Clustering, with both Hyper-V and VMware providing detailed support.

Linux also supports clusters, and each Linux version has its own implementation instructions for Hyper-V or VMware - for example, implementing RedHat Cluster Suite with RedHat or CentOS Linux on Hyper-V. For more details, refer to the documentation for your Linux version. 

 

 

Cloud Recovery options

VMware and Hyper-V both have cloud based extensions that offer recovery services if a site goes down. Hyper-V offers Microsoft Azure Site Recovery Service and VMware offers vCloud Air Disaster Recovery to recover VMware to a cloud based environment. Additionally, VMware offers Site Recovery Manager, which can recover VMware to a backup site rather than to the Cloud.

Cloud and offsite options are fully-fledged disaster recovery suites that can take over site operations if the primary site becomes unavailable, but keep in mind you will be paying for cloud resources while they are in operation.

 

Conclusions

Ensuring the availability of VMs comes with tradeoffs:

  • Automatic failover uses fewer resources but has inherent delays and the potential for lost information when a VM fails over to a new host.

  • Fault tolerance addresses the drawbacks of automatic failover, but uses more resources and won’t fix an OS crash.

  • Software clustering can increase availability by spreading applications across multiple VMs, but, once again, at the cost of more resources.

  • At the upper end of availability measures is offsite recovery, which can replicate your entire virtualized infrastructure, but also needs to replicate resources to do so.

 

Want to learn more?

Download our Overcommitting VMware Resources Whitepaper for the guidelines you need to ensure that you are getting the most out of your host resources without sacrificing performance.

Download the whitepaper:  Overcommitting VMware Resource