Table of Contents
Fetching ...

Leveraging Compute-in-Memory for Efficient Generative Model Inference in TPUs

Zhantong Zhu, Hongou Li, Wenjie Ren, Meng Wu, Le Ye, Ru Huang, Tianyu Jia

TL;DR

This paper tackles the high power and area requirements of hardware accelerators for generative model inference by introducing a digital CIM-based TPU design that replaces traditional MXUs with CIM-MXUs. It presents a cohesive architecture model and simulator built on a TPUv4i-like baseline, along with a detailed CIM-MXU design and workload mapping strategy. Extensive evaluations on representative models (LLMs and diffusion transformers) demonstrate substantial gains in energy efficiency and, for certain workloads, latency improvements, with larger benefits in GEMV-dominant phases like decoding. The work also explores architectural trade-offs and multi-device scaling, showing meaningful throughput and energy reductions that could enable practical, large-scale deployment of generative models on CIM-enhanced TPUs.

Abstract

With the rapid advent of generative models, efficiently deploying these models on specialized hardware has become critical. Tensor Processing Units (TPUs) are designed to accelerate AI workloads, but their high power consumption necessitates innovations for improving efficiency. Compute-in-memory (CIM) has emerged as a promising paradigm with superior area and energy efficiency. In this work, we present a TPU architecture that integrates digital CIM to replace conventional digital systolic arrays in matrix multiply units (MXUs). We first establish a CIM-based TPU architecture model and simulator to evaluate the benefits of CIM for diverse generative model inference. Building upon the observed design insights, we further explore various CIM-based TPU architectural design choices. Up to 44.2% and 33.8% performance improvement for large language model and diffusion transformer inference, and 27.3x reduction in MXU energy consumption can be achieved with different design choices, compared to the baseline TPUv4i architecture.

Leveraging Compute-in-Memory for Efficient Generative Model Inference in TPUs

TL;DR

This paper tackles the high power and area requirements of hardware accelerators for generative model inference by introducing a digital CIM-based TPU design that replaces traditional MXUs with CIM-MXUs. It presents a cohesive architecture model and simulator built on a TPUv4i-like baseline, along with a detailed CIM-MXU design and workload mapping strategy. Extensive evaluations on representative models (LLMs and diffusion transformers) demonstrate substantial gains in energy efficiency and, for certain workloads, latency improvements, with larger benefits in GEMV-dominant phases like decoding. The work also explores architectural trade-offs and multi-device scaling, showing meaningful throughput and energy reductions that could enable practical, large-scale deployment of generative models on CIM-enhanced TPUs.

Abstract

With the rapid advent of generative models, efficiently deploying these models on specialized hardware has become critical. Tensor Processing Units (TPUs) are designed to accelerate AI workloads, but their high power consumption necessitates innovations for improving efficiency. Compute-in-memory (CIM) has emerged as a promising paradigm with superior area and energy efficiency. In this work, we present a TPU architecture that integrates digital CIM to replace conventional digital systolic arrays in matrix multiply units (MXUs). We first establish a CIM-based TPU architecture model and simulator to evaluate the benefits of CIM for diverse generative model inference. Building upon the observed design insights, we further explore various CIM-based TPU architectural design choices. Up to 44.2% and 33.8% performance improvement for large language model and diffusion transformer inference, and 27.3x reduction in MXU energy consumption can be achieved with different design choices, compared to the baseline TPUv4i architecture.

Paper Structure

This paper contains 16 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Evolution of the computing performance of CIM-based designs.
  • Figure 2: Generative model architecture and runtime breakdown.
  • Figure 3: Architecture modeling of CIM-based TPU.
  • Figure 4: Architecture and CIM design details of CIM-MXU.
  • Figure 5: Workload evaluations with a mapping engine for CIM-based TPUs.
  • ...and 3 more figures