Cost optimization of Google Cloud VMware Engine deployments | Google Cloud Blog

Google Cloud VMware Engine allows you to deploy a managed VMware environment with dedicated hardware in minutes with the flexibility to add and remove ESXi nodes on-demand. This flexibility is particularly convenient to quickly add compute capacity as needed. However, it is important to implement cost optimization strategies and review processes such that increased cost does not come as a surprise. Since hardware needs to be added to ESXi clusters to accomodate for increased workload capacity, additional care needs to be taken when scaling out the Private Clouds.

In this blog, we explore best practices for operating Google Cloud VMware Engine Private Clouds with a focus on optimizing overall cost.

Google Cloud VMware Engine Billing Principles

Customers are billed hourly by the number of VMware ESXi nodes that have been deployed. Similar to the commitment-based pricing model that Compute Engine offers for Virtual Machine instances, Google Cloud VMware Engine offers committed use discounts for one and three-year terms. For detailed pricing information refer to the pricing section on the Google Cloud VMware Engine Product Overview page.

Cost Optimization Strategy #1: Apply Committed Use Discounts

Committed Use Discounts (CUDs) are discounts based on a commitment to running a number of Google Cloud VMware Engine nodes in a particular region for either a one- or three-year term. CUDs for a three-year commitment can provide up to a 50% discount if their cost is invoiced in full at the start of the contract. As you might expect, commitment-based discounts cannot be canceled once purchased so you need confidence in how many nodes you will run in your Private Cloud. Apply this discount for the minimal amount of nodes you will run over a one- or three-year period and revise the number of needed nodes regularly if Google Cloud VMware Engine is the target platform of your data center migration.

Cost Optimization Strategy #2: Optimize your Storage Consumption

If your workloads consume a lot of storage (e.g. backup systems, file servers or large databases), you may need to scale out the cluster to add additional vSAN storage capacity. Consider the following optimization strategies to keep storage consumption low:

1. Apply custom storage policies which use RAID 5 or RAID 6 rather than RAID 1 (default storage policy) while achieving the same Failures to Tolerance (FTT). The FTT number is a key metric as it is directly linked to the monthly uptime SLA of the ESXi cluster. A storage policy using RAID 1 with FTT=1 incurs a 100% storage overhead since a RAID 1 encoding scheme mirrors blocks of data for redundancy. Similarly, if FTT=2 is required (e.g. for more critical workloads that require a 99.99% uptime availability) RAID 1 produces a 200% storage overhead. A RAID 5 configuration is more storage efficient, as a storage policy with FTT=1 can be achieved with only a 33% storage overhead. Likewise, RAID 6 can provide FTT=2 with only 50% storage overhead which means 50% of storage consumption can be saved compared to using the default storage policy with RAID 1. Note, however, that there is a tradeoff: RAID 1 requires fewer I/O operations to the storage devices and may provide better performance.

2. Create new disks using the “Thin Provisioning” format. Thin provisioned disks save storage space as they start small and expand as more data is written to the disk. 

3. Avoid Backups filling up vSAN storage. Backup tools such as Actifio provide integrations of VMware environments and Cloud Storage allowing operators to move backups to a cheaper storage location. Backup data which needs to be retained long-term should be stored inside Cloud Storage buckets with life-cycle policies to move data to a cheaper storage class after a certain time period.

4. Enable deduplication and compression on the vSAN cluster to reduce the amount of redundant data blocks and hence the overall storage consumption.

Cost Optimization Strategy #3: Rightsize ESXi Clusters

1. Size ESXi clusters such that CPU, Memory and storage utilization are at a high level but support the outage of an ESXi node without any failure of workloads. Operating a cluster with resource utilization close to full capacity might cause an outage in case of a sudden hardware failure. Having the highest resource utilization metric (CPU, Memory or Storage) at approximately 70% allows a safe operation of the cluster while also using the capabilities of the cluster.

2. Start new Private Cloud deployments with single-node clusters and expand them when needed. Google Cloud VMware Engine has recently added the ability to create private clouds that contain a single node for testing and proofs of concept. Single node clusters can only be kept for a maximum of 60 days and do not have any SLA. However, they provide a great method to minimize cost during the integration with day-2 tooling, configurations and testing. Once the Private Cloud landing zone is fully configured, the one-node cluster can be extended to a three-node cluster to be eligible for an SLA.

3. Consolidate ESXi clusters if possible: If you are running workloads on multiple clusters, review if clusters can be consolidated for a more efficient balancing of resources across clusters. As an example, if workloads are divided by OS type or by production and non-production, a higher resource utilization may be achieved if clusters are consolidated. However, there should be care taken in reviewing whether there are other constraints which would prevent consolidation, such as licensing requirements. If VMs need to run on specific nodes for licensing requirements, consider DRS Affinity rules to pin workloads to specific nodes.

4. Consolidate Private Clouds if possible: Review if Private Clouds can be consolidated if you run more than one. Each Private Cloud requires its own set of management VMs which causes an overhead to the overall resource consumption. 

Cost Optimization Strategy #4: Review Resource Utilization of Workloads

1. Review the resource utilization of VMs on an on-going basis after running applications in a steady state. Extract vCenter metrics programmatically or visually to provide right-sizing recommendations. For instance, if it is noticed that VMs consume more CPU and memory resources than needed, tune the VM parameters during a scheduled downtime of the application (requires reboot).

Consider scheduling the execution of a script which extracts CPU and memory utilization statistics from vCenter and stores the data in a convenient format such as in a CSV file. As an example, this can be implemented using PowerShell from a script execution host.

Source Link

Read in Hindi >>