Google Cloud a Leader in The Forrester Wave™: AI Infrastructure | Google Cloud Blog

Forrester Research has named Google Cloud a Leader in The Forrester Wave™: AI Infrastructure, Q4 2021 report authored by Mike Gualtieri and Tracy Woo. In the report, Forrester evaluated dimensions of AI architecture, training, inference and management against a set of pre-defined criteria. Forrester’s analysis and recognition gives customers the confidence they need to make important platform choices that will have lasting business impact. 

Google received the highest possible score in 16 Forrester Wave evaluation criteria: architecture design, architecture components, training software, training data, training throughput, training latency, inferencing throughput, inferencing latency, management operations, management external, deployment efficiency, execution roadmap, innovation roadmap, partner ecosystem, commercial model, and number of customers.

We believe that Google’s vision to be a unified data and AI solution provider for the end-to-end data science experience is recognized by Forrester, through high scores in the areas of architecture and innovation. We are focused on building the most robust yet cohesive experience to enable our customers to leverage the best of Google every step of the way. Here are four key areas where Google excels, among the many highlighted in this report. 

AI Infrastructure: Leverage the building blocks of innovation

When an organization chooses to run its business on Google Cloud, it benefits from innovative infrastructure available globally. Google offers users a rich set of building blocks such as Deep Learning VMs and containers, the latest GPUs/TPUs and a marketplace of curated ISV offerings to help architect your own custom software stack on VMs and/or Google Kubernetes Engine (GKE). 

Google provides a range of GPU & TPU accelerators for various use cases, including high performance training, low cost inferencing and large-scale accelerated data processing. Google is the only public cloud provider to offer up to 16 NVIDIA A100 GPUs in a single VM, making it possible to train very large AI models on a single node. Users can start with one NVIDIA A100 GPU and scale to 16 GPUs without configuring multiple VMs for single-node ML training.  Google also provides TPU pods for large-scale AI research with PyTorch, TensorFlow, and JAX.  The new fourth generation TPU pods deliver exaflop-scale peak performance with leading results in recent MLPerf benchmarks which included a 480 billion parameter language model.   

Google Kubernetes Engine provides the most advanced Kubernetes services with unique capabilities like Autopilot, highly automated cluster version upgrades, and cluster backup/restore. GKE is a good choice for a scalable multi-node bespoke platform for training, inference and Kubeflow pipelines, given its support for 15,000 nodes per cluster, auto-provisioning,  auto-scaling and various machine types (e.g. CPU, GPU, TPU and on-demand, spot). ML workloads also benefit from GKE’s support for dynamic scheduling, orchestrated maintenance, high availability, job API, customizability, fault tolerance and ML frameworks.  When a company’s footprint grows to a fleet of GKE clusters, its data teams can leverage Anthos Config Management to enforce consistent configurations and security policy compliance. 

Comprehensive MLOps: Build models faster and more easily without skimping on governance 

Google’s fully managed Vertex AI platform provides services for ML lifecycle management, from data ingestion and preparation all the way up to model deployment, monitoring, and management. Vertex AI requires nearly 80% fewer lines of code to train a model versus competitive platforms1, enabling data scientists and ML engineers across all levels of expertise to implement Machine Learning Operations (MLOps) so they can efficiently build and manage ML projects throughout the entire development lifecycle. 

Vertex AI Workbench provides data scientists with a single environment for the entire data-to-ML workflow, enabling data scientists to build and train models 5x faster than traditional notebooks. This is enabled by integrations across data services (like Dataproc, BigQuery, Dataplex, and Looker), which significantly reduce context switching.  Users are also able to access NVIDIA GPUs, modify hardware on the fly, and set up idle shutdown to optimize infrastructure costs. 

Organizations can then build and deploy models built on any framework (including TensorFlow, PyTorch, Scikit learn or XGBoost) with Vertex AI, with built-in tooling to track a model’s performance.  Vertex Training also provides various approaches for developing large models including Reduction Server to optimize bandwidth and latency of multi-node distributed training on NVIDIA GPUs for synchronous data parallel algorithms.  Vertex AI Prediction is serverless, and performs automatic provisioning and deprovisioning of nodes behind the scenes to provide low latency online predictions. It also provides the capability to split traffic between multiple models behind an endpoint. Models trained in Vertex AI can also be exported to be deployed in private or other public clouds.

Source Link

Read in Hindi >>