Cloud Capacity Planning and performance reporting are absolute musts if an organization is to ensure that it is making the most efficient use of its cloud service providers and their corresponding costs.
While cloud computing does provide tremendous value, it is not without its challenges. Because of the ease at which IT can make large amounts of cloud resources available to its constituents, careful consideration has to be given as to how best size and time the deployment of cloud resources so that costs do not get out of hand.
Virtual Server Sprawl
Organizations typically spend an enormous amount of time determining which cloud service providers to deploy, usually deciding between Amazon Web Services, Microsoft Azure, and Google Compute. Typically the due diligence includes looking at the configuration/pricing options, workload, and either deciding to go all-in on one provider or hedging their bets with a multi-provider cloud strategy.
Accurately assessing workload is critical to cost effectively deploying cloud resources. No matter the choice made, it is never an easy or straight forward process because there are so many variables that enter into choosing a cloud service provider.
As is usually the case, once management makes the final decision the cloud implementation is delegated to IT where they are told to “make it so”. Because IT organizations are measured on the basis of providing reliable and high performing services, they will almost always err on configuring cloud resources to exceed agreed upon SLAs.
Cloud services along with provisioning technologies like Puppet, Chef, SaltStack, and Ansible make it incredibly easy to automate and deliver servers and applications - costs can escalate quickly when cloud resources are over allocated.
|Virtual server sprawl is ultimately due to:
Unlike on-premises resources where server costs are relatively fixed, the cost of cloud resources tend to be variable which means it can get real expensive real fast!
On-premises computing does provide a natural backstop for over provisioning because it is tangible – there is a finite amount of server and storage infrastructure residing in the datacenter(s). When host capacity is reached IT has to either adjust how their virtual resources are being used or jump through the appropriate financial hoops to obtain Capex for additional hardware. In addition, because of sunk hardware costs, virtual sprawl doesn’t necessarily cost the organization unless or until more hardware is needed.
Because cloud resources are off-premises, virtual server sprawl is less obvious. Virtual sprawl doesn't happen overnight, rather it is the result of a slow and steady increase in servers.
|Longitude Report showing VM Sprawl for EC2 Instances|
Keep in mind cloud services costs (typically are Opex) are comprised of more than just compute costs - you still need to pay for storage, snapshots, data transfer, data archiving and more – so it certainly adds up!
Cloud providers do provide budget alerts to help backstop virtual server sprawl. However, problems usually occur when IT waits for the budget alerts before addressing inefficiencies. Depending on the timing of the situation an IT organization may be forced to scramble to reallocate cloud resources or it may have to pay substantial fees for access to additional resources.
|Misallocation of cloud resources turns a
Capex problem into an Opex problem!
Cloud Capacity Planning – On Demand Instances and Spare Capacity
Microsoft and especially Amazon provide a dizzying array of purchasing options. It is important to understand that no matter the purchasing options, that the underlying cloud resources are always the same.
The decision for cloud resources is analogous to deciding on an automobile - sizing based on the needed capacity - whether to purchase or lease - the length of the lease - and how much to pay up front.
Both Azure and AWS provide pay as you go “On Demand Instances” (DIs) for server capacity.
DIs benefit from no up-front costs and no long-term commitment - you pay only for the resources you use. In addition, compute capacity can be readily adjusted to accommodate any changes in workload.
DIs tend to be the most expensive option. As a general rule DIs are most advantageous in situations where the workload peaks and then disappears - for example an application that processes on a monthly basis.
You can save money with EC2 Spot Instance or Azure Low-priority VMs which allow your organization to take advantage of spare cloud capacity at a significant discount. Keep in mind whether you’re taking advantage of DIs, EC2 Spot Instance, or Azure Low-priority VMs that the instances can still be preempted by higher priority “Reserved Instances”. To be fair EC2 Spot Instance and Azure Low-priority VMs are more likely to be preempted than their DI counterparts.
Cloud Capacity Planning is not simply a matter of looking at resource consumption (i.e. CPU, Memory, Disk, etc.), it is also about looking at the timing of the workloads:
Cloud Capacity Planning – Reserved Instances
As the name implies, with Reserved Instances (RI) you are “reserving” a defined amount of compute capacity, usually for 1 or 3 years. You are also committing to paying for the capacity whether you use it or not (well.. sort off… more on this a little later). RIs benefit from a significant reduction in cost (~ 70%) over DIs.
As is usually the case with cloud service providers an apples-to-apples comparison is not at all that easy. Although Amazon offers a more flexible set of Reserved Instance purchasing options over Microsoft, Microsoft's Azure Hybrid Benefit can be compelling as it allows users to bring their Windows Server licenses covered by Software Assurance (SA) to Azure at discounted rate. Just make sure to read Microsoft's fine print as there are differences as to how Standard and Datacenter licenses are handled.
|Cloud Capacity Planning is critical to any Reserved Instance deployment because your organization is committing to pay for a defined amount of capacity for 1 or 3 years whether it uses it or not!|
Capacity planning for a workload that spans a 1-year period of time is an easier endeavor than planning for a 3-year period of time:
- As a simple matter of probability projecting out 3 years is inherently a more risky proposition.
A 3 year agreement can get expensive, especially if the projected workload does not come to fruition. In the end it is all about risk and reward. Is it worth the additional savings to go out 3 years? What is the likelihood of having unused or misallocated RIs? How much over commitment is your organization willing to tolerate?
- Pricing is trending downward with both Amazon and Microsoft constantly moving the bar with lower prices and revised offerings (i.e. new licensing schemes for RIs).
You’ll need to make a determination as to whether it is worth committing to 3 years (especially if there is limited ability to take advantage of configuration changes or price reductions) or whether 1 year at a time is more appropriate. Ultimately your decision may well depend on the type of a deal you can wrangle from your cloud service provider.
What happens if you have more RIs than you need - you made your bet and lost.. what do you do?
Amazon supports a secondary market for third-parties to purchase unwanted RIs. While Microsoft makes things quite a bit easier - simply cancel the RI’s at any time and Microsoft will buy back the unwanted RIs.
Selling unused RIs can be a costly proposition as both Amazon and Microsoft include a termination fee of 12% of the price.
Cloud Capacity Planning – Hybrid Cloud
Many organizations operate a combination of an on-premises IT infrastructure in conjunction with one or more cloud services providers, a “Hybrid Cloud”. While cloud computing delivers the benefits of increased efficiencies and reduced costs there may well be workloads that are better suited for on-premises. Either way, IT organizations have to keep operating costs down and productivity up.
|Capacity planning becomes an increasingly difficult proposition when organizations don’t have an integrated approach to track all the components across the hybrid infrastructure.|
On-premises workloads are often necessary because of compliance issues related to security, privacy, and control. Things become a bit more challenging when applications leverage both on-premises and cloud resources - as a bottleneck could have a rippling effect across the entire IT infrastructure.
Ultimately proper Cloud Capacity Planning is about accurately gauging the size and type of workload and matching to an appropriate set of Reserved Instances. Here are a few helpful questions that can help guide you:
Plan too low and you may not be able to take advantage of the price breaks that come with economies of scale. Plan too high and you’re either paying for unused resources or having to unload them at a significant penalty.