Table of Contents
Fetching ...

TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings

Norman P. Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, Cliff Young, Xiang Zhou, Zongwei Zhou, David Patterson

TL;DR

TPU v4 introduces an optically reconfigurable interconnect (OCS) and SparseCore embedding support to create a scalable, energy-efficient ML supercomputer. By enabling topology reconfiguration (including twisted torus) and embedding-specific dataflow for large DLRMs, TPU v4 achieves up to 2.1x peak performance and 2.7x performance per watt over TPU v3, while scaling to 4096 chips. The architecture delivers practical gains for LLMs and production ML workloads, with faster deployment, improved utilization, and enhanced security, all at low incremental cost. Together, these innovations enable more flexible, efficient, and scalable ML training in warehouse-scale Google data centers, with significant reductions in energy use and CO2e relative to contemporaries.

Abstract

In response to innovations in machine learning (ML) models, production workloads changed radically and rapidly. TPU v4 is the fifth Google domain specific architecture (DSA) and its third supercomputer for such ML models. Optical circuit switches (OCSes) dynamically reconfigure its interconnect topology to improve scale, availability, utilization, modularity, deployment, security, power, and performance; users can pick a twisted 3D torus topology if desired. Much cheaper, lower power, and faster than Infiniband, OCSes and underlying optical components are <5% of system cost and <3% of system power. Each TPU v4 includes SparseCores, dataflow processors that accelerate models that rely on embeddings by 5x-7x yet use only 5% of die area and power. Deployed since 2020, TPU v4 outperforms TPU v3 by 2.1x and improves performance/Watt by 2.7x. The TPU v4 supercomputer is 4x larger at 4096 chips and thus ~10x faster overall, which along with OCS flexibility helps large language models. For similar sized systems, it is ~4.3x-4.5x faster than the Graphcore IPU Bow and is 1.2x-1.7x faster and uses 1.3x-1.9x less power than the Nvidia A100. TPU v4s inside the energy-optimized warehouse scale computers of Google Cloud use ~3x less energy and produce ~20x less CO2e than contemporary DSAs in a typical on-premise data center.

TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings

TL;DR

TPU v4 introduces an optically reconfigurable interconnect (OCS) and SparseCore embedding support to create a scalable, energy-efficient ML supercomputer. By enabling topology reconfiguration (including twisted torus) and embedding-specific dataflow for large DLRMs, TPU v4 achieves up to 2.1x peak performance and 2.7x performance per watt over TPU v3, while scaling to 4096 chips. The architecture delivers practical gains for LLMs and production ML workloads, with faster deployment, improved utilization, and enhanced security, all at low incremental cost. Together, these innovations enable more flexible, efficient, and scalable ML training in warehouse-scale Google data centers, with significant reductions in energy use and CO2e relative to contemporaries.

Abstract

In response to innovations in machine learning (ML) models, production workloads changed radically and rapidly. TPU v4 is the fifth Google domain specific architecture (DSA) and its third supercomputer for such ML models. Optical circuit switches (OCSes) dynamically reconfigure its interconnect topology to improve scale, availability, utilization, modularity, deployment, security, power, and performance; users can pick a twisted 3D torus topology if desired. Much cheaper, lower power, and faster than Infiniband, OCSes and underlying optical components are <5% of system cost and <3% of system power. Each TPU v4 includes SparseCores, dataflow processors that accelerate models that rely on embeddings by 5x-7x yet use only 5% of die area and power. Deployed since 2020, TPU v4 outperforms TPU v3 by 2.1x and improves performance/Watt by 2.7x. The TPU v4 supercomputer is 4x larger at 4096 chips and thus ~10x faster overall, which along with OCS flexibility helps large language models. For similar sized systems, it is ~4.3x-4.5x faster than the Graphcore IPU Bow and is 1.2x-1.7x faster and uses 1.3x-1.9x less power than the Nvidia A100. TPU v4s inside the energy-optimized warehouse scale computers of Google Cloud use ~3x less energy and produce ~20x less CO2e than contemporary DSAs in a typical on-premise data center.
Paper Structure (36 sections, 17 figures, 6 tables)

This paper contains 36 sections, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Connectivity of a$4 \times 4 \times 4$ cube (top) to 3 OCSes (bottom). The "+" and "-" connections with the same dimension and index are connected to the same OCS; 48 of these in-out pairs each connect to a distinct OCS.
  • Figure 2: The TPU v4 package (ASIC in center plus 4 HBM stacks) and printed circuit board with 4 liquid-cooled packages. The board's front panel has 4 top-side PCIe connectors and 16 bottom-side OSFP connectors for inter-tray ICI links.
  • Figure 3: Eight of 64 racks for one 4096-chip supercomputer.
  • Figure 4: Impact of OCS connected versus a statically connected supercomputer on goodput (i.e., effective throughput) as CPU availability and slice size varies on a log scale. Goodput is counterintuitive at large slices. At$\frac{1}{4}$ of the 4 K chips, goodput for both $99.0 \%$ and $99.5 \%$ is $75 \%$, as 3 slices occupy $\frac{3}{4}$ of the chips. Spares are needed to allow scheduling jobs despite some failed nodes, so you can't realistically schedule two 2 k node slices from 4 k nodes. With one 2 k node slice ( $50 \%$ of 4 k ), you have $50 \%$ spares, so it will have $50 \%$ goodput. With 3 k nodes ( $75 \%$ of 4 k ), you have $25 \%$ spares, and therefore $75 \%$ goodput.
  • Figure 5: Example of regular (top) and twisted torus (bottom) topologies for a$\mathbf{4} \times \mathbf{2}$ slice of TPU v4 nodes. The TPU v4 network is three-dimensional, but the figure uses two dimensions for ease of illustration. Each TPU is labeled with its coordinates in the slice. The electrical connections (red dashed lines) remain fixed. By utilizing the flexibility of the OCSs, the optical connections (blue solid lines) can be reconfigured from a rectangular torus to a twisted torus without any physical recabling of the machine; the only change is in the routing tables. TPU v4 uses a $\mathrm{k} \times \mathrm{k} \times 2 \mathrm{k}$ configuration from Camarero, Martinez, and Beivide [8].
  • ...and 12 more figures