Google Rolls Out TPU/GPU Enhancements for Large-Scale Data Center AI Workloads

It's been a busy summer for the data center industry particularly in the area of AI, and now, just in time for "back-to-school" season, comes new TPU / GPU hardware and software goodies from Google for furnishing enhanced cloud capabilities.

As previously noted in DCF reporting by David Chernicoff, cloud-based AI is having an immediate impact on the way the world does business. Google offers more than 20 AI products, services, and solution sets via Google Cloud, whose primary constituency includes developers and data scientists.

Chernicoff notes further:

"Google defines their available services by use case, breaking those down by data scientist, developer, infrastructure, business solutions, and consulting.
Infrastructure is where customers get access to just that; preconfigured deep learning containers and deep learning VMs to run on the Google cloud, but also access to high performance GPUs, Google’s in-house developed hardware accelerated TPUs, and TensorFlow enterprise, which allows users to scale resources across CPUs, GPUs, and TPUs.
This infrastructure supports Google's AI tools, from their unified ML development platform, Vertex AI, to their AutoML tool for training custom ML models."

In an August 29th blog, Google's Amin Vahdatm VP/GM ML, Systems, and Cloud AI; and Mark Lohmeyer VP/GM Compute and ML Infrastructure announced significant enhancements to Google Cloud's AI infrastructure technologies in the form of TPUs and GPUs. The central announcements were two:

Now available in preview, the TPU v5e platform integrates with Google Kubernetes Engine (GKE) and Vertex AI systems, as well as burgeoning frameworks such as Pytorch, JAX and TensorFlow, allowing users to get started with easy-to-use, familiar interfaces.
Concurrently announced, the company's A3 VMs, based on Nvidia H100 GPUs and delivered as a GPU supercomputer, are to be made generally available this month to power large-scale AI models.

Hitting the AI sweet spot

The company bills its Cloud TPU v5e hardware and software platform as its "most cost-efficient, versatile, and scalable Cloud TPU to date."

The company added that the platform is tailored to hit the performance and cost-efficiency "sweet spot" for medium- and large-scale training and inference purposes, delivering up to 2x higher training performance per dollar, and up to 2.5x inference performance per dollar for LLMs and generative AI models, as compared to Cloud TPU v4.

At less than half the cost of TPU v4, the company said TPU v5e makes it possible for more organizations to train and deploy larger, more complex AI models. How so? The TPU v5e pods allow up to 256 chips to be interconnected with an aggregate bandwidth of more than 400 Tb/s and 100 petaOps of INT8 performance.

The TPU v5e platform also provides support for eight different virtual machine (VM) configurations, ranging from one chip to more than 250 chips within a single slice, allowing customers to choose the right configurations for serving a wide range of LLM and generative AI model sizes.

"Our speed benchmarks are demonstrating a 5X increase in the speed of AI models when training and running on Google Cloud TPU v5e," asserted Wonkyum Lee, Head of Machine Learning, Gridspace. "We are also seeing a tremendous improvement in the scale of our inference metrics. We can now process 1000 seconds in one real-time second for in-house speech-to-text and emotion prediction models — a 6x improvement.”

The blog also announced the general availability of Cloud TPUs in GKE, the scalable and ubiquitous Kubernetes service. Customers can now improve AI development productivity by leveraging GKE to manage large-scale AI workload orchestration on Cloud TPU v5e, as well as on Cloud TPU v4.

Additionally as part the enhanced platform launch, the Google Vertex AI tool now supports training with various frameworks and libraries using Cloud TPU VMs for organizations that prefer the simplicity of managed services.

The company noted the Cloud TPU v5e platform also now supports AI frameworks provided by open-source tools such as Hugging Face’s Transformers and Accelerate platforms, as well as the PyTorch Lightning and Ray systems. The upcoming PyTorch/XLA 2.1 release will include Cloud TPU v5e support, along with new features including model and data parallelism for large-scale model training, among others.

Finally, the company's new Multislice technology, now available in previews, allows users to easily scale AI models beyond the boundaries of physical TPU pods over up to tens of thousands of Cloud TPU v5e or TPU v4 chips for easier scale-up of training jobs.

Announced at last month's Google Cloud Next, Multislice is billed as a full-stack large-scale training technology that enables near-linear scaling up to tens of thousands of Cloud Tensor Processing Units (TPU) chips.

As further noted by the Google blog:

Historically, a training run could only use a single slice, a reservable collection of chips connected via inter-chip interconnect (ICI). This meant that a run could use no more than 3072 TPU v4 chips, which is currently the largest slice in the largest Cloud TPU system. With Multislice, a training run can scale beyond a single slice and use multiple slices over a number of pods by communicating over data center networking (DCN).
Until now, training jobs using TPUs were limited to a single slice of TPU chips, capping the size of the largest jobs at a maximum slice size of 3,072 chips for TPU v4. With Multislice, developers can scale workloads up to tens of thousands of chips over inter-chip interconnect (ICI) within a single pod, or across multiple pods over a data center network (DCN). Multislice technology powered the creation of our state-of-the-art PaLM models; now we are pleased to make this innovation available to Google Cloud customers."

“Easily one of the most exciting features for our development team is the unified toolset. The amount of wasted time and hassle they can avoid by not having to make different tools fit together will streamline the process of taking AI from idea to training to deployment," observed Yoav HaCohen, Core Generative AI Team Lead, Lightricks, who added, "Configuring and deploying your AI models on Google Kubernetes Engine and Google Compute Engine along with Google Cloud TPU infrastructure enables our team to speed up the training and inference of the latest foundation models at scale, while enjoying support for autoscaling, workload orchestration, and automatic upgrades.“

A3 VMs power GPU supercomputers to drive gen AI workloads in partnership with Nvidia

To empower customers to take advantage of the rapid advancements in AI, Google Cloud is partnering closely with Nvidia on new AI cloud infrastructure and open-source tools for Nvidia GPUs, as well as building workload-optimized end-to-end solutions specifically for generative AI.

Google Cloud and Nvidia aim to make AI more accessible for a broad set of workloads. The blog notes how that vision is now being realized with Google's co-announcement that A3 VMs for powering GPU supercomputers' generative AI workflows will be generally available this month.

Powered by Nvidia’s H100 Tensor Core GPUs, which address trillion-parameter models, the Google Cloud A3 VMs are especially built to train and serve especially demanding generative AI workloads and LLMs.

The companies said combination of Nvidia GPUs with Google Cloud’s infrastructure technologies stands to improve scale and performance and represents a significant extension in supercomputing capabilities, with 3x faster training, and 10x greater networking bandwidth compared to the prior generation platform.

The A3 enhancement enables users to scale models to tens of thousands of Nvidia H100 GPUs.

As further specified by Google:

"The A3 VM features dual next-generation 4th Gen Intel Xeon scalable processors, eight Nvidia H100 GPUs per VM, and 2TB of host memory. Built on the latest Nvidia HGX H100 platform, the A3 VM delivers 3.6 TB/s bisectional bandwidth between the eight GPUs via fourth generation Nvidia NVLink technology. Improvements in the A3 networking bandwidth are powered by our Titanium network adapter and Nvidia Collective Communications Library (NCCL) optimizations. In all, A3 represents a huge boost for AI innovators and enterprises looking to build the most advanced AI models."

The following video provides an inside view of the data centers where the Google Cloud TPUs operate. As noted by the company, "Customers use Cloud TPUs to run some of the world's largest AI workloads and that power comes from much more than just a chip." The video provides a rare peek at the data center physical infrastructure of the TPU system, including core components for data center networking, optical circuit switches, advanced biometric security verification capabilities, and new cooling systems technology.