Table of Contents
Fetching ...

A Tensor Compiler for Processing-In-Memory Architectures

Peiming Yang, Sankeerth Durvasula, Ivan Fernandez, Mohammad Sadrosadati, Onur Mutlu, Gennady Pekhimenko, Christina Giannoula

TL;DR

This work tackles the data movement bottlenecks in Processing-In-Memory (PIM) for memory-intensive ML kernels by introducing DCC, the first data-centric ML compiler for PIM systems. DCC jointly optimizes data rearrangements and compute code through a four-component design: a multi-layer abstraction that reconciles memory hierarchies with a compute hierarchy, a data-centric schedule generator that creates data tiles mapped to compute tiles, a PIM-specific optimizer, and a coupled predictor based on XGBoost to select end-to-end best configurations. Evaluations on two state-of-the-art PIM backends, HBM-PIM and AttAcc, show substantial gains, with kernels achieving up to 13.17x speedups over GPU and end-to-end LLM inference on AttAcc improving up to 7.71x. The results demonstrate that integrating data movement costs into the compilation process is crucial for unlocking PIM performance and programmability, enabling scalable acceleration for diverse ML workloads and models.

Abstract

Processing-In-Memory (PIM) devices integrated with high-performance Host processors (e.g., GPUs) can accelerate memory-intensive kernels in Machine Learning (ML) models, including Large Language Models (LLMs), by leveraging high memory bandwidth at PIM cores. However, Host processors and PIM cores require different data layouts: Hosts need consecutive elements distributed across DRAM banks, while PIM cores need them within local banks. This necessitates data rearrangements in ML kernel execution that pose significant performance and programmability challenges, further exacerbated by the need to support diverse PIM backends. Current compilation approaches lack systematic optimization for diverse ML kernels across multiple PIM backends and may largely ignore data rearrangements during compute code optimization. We demonstrate that data rearrangements and compute code optimization are interdependent, and need to be jointly optimized during the tuning process. To address this, we design DCC, the first data-centric ML compiler for PIM systems that jointly co-optimizes data rearrangements and compute code in a unified tuning process. DCC integrates a multi-layer PIM abstraction that enables various data distribution and processing strategies on different PIM backends. DCC enables effective co-optimization by mapping data partitioning strategies to compute loop partitions, applying PIM-specific code optimizations and leveraging a fast and accurate performance prediction model to select optimal configurations. Our evaluations in various individual ML kernels demonstrate that DCC achieves up to 7.68x speedup (2.7x average) on HBM-PIM and up to 13.17x speedup (5.75x average) on AttAcc PIM backend over GPU-only execution. In end-to-end LLM inference, DCC on AttAcc accelerates GPT-3 and LLaMA-2 by up to 7.71x (4.88x average) over GPU.

A Tensor Compiler for Processing-In-Memory Architectures

TL;DR

This work tackles the data movement bottlenecks in Processing-In-Memory (PIM) for memory-intensive ML kernels by introducing DCC, the first data-centric ML compiler for PIM systems. DCC jointly optimizes data rearrangements and compute code through a four-component design: a multi-layer abstraction that reconciles memory hierarchies with a compute hierarchy, a data-centric schedule generator that creates data tiles mapped to compute tiles, a PIM-specific optimizer, and a coupled predictor based on XGBoost to select end-to-end best configurations. Evaluations on two state-of-the-art PIM backends, HBM-PIM and AttAcc, show substantial gains, with kernels achieving up to 13.17x speedups over GPU and end-to-end LLM inference on AttAcc improving up to 7.71x. The results demonstrate that integrating data movement costs into the compilation process is crucial for unlocking PIM performance and programmability, enabling scalable acceleration for diverse ML workloads and models.

Abstract

Processing-In-Memory (PIM) devices integrated with high-performance Host processors (e.g., GPUs) can accelerate memory-intensive kernels in Machine Learning (ML) models, including Large Language Models (LLMs), by leveraging high memory bandwidth at PIM cores. However, Host processors and PIM cores require different data layouts: Hosts need consecutive elements distributed across DRAM banks, while PIM cores need them within local banks. This necessitates data rearrangements in ML kernel execution that pose significant performance and programmability challenges, further exacerbated by the need to support diverse PIM backends. Current compilation approaches lack systematic optimization for diverse ML kernels across multiple PIM backends and may largely ignore data rearrangements during compute code optimization. We demonstrate that data rearrangements and compute code optimization are interdependent, and need to be jointly optimized during the tuning process. To address this, we design DCC, the first data-centric ML compiler for PIM systems that jointly co-optimizes data rearrangements and compute code in a unified tuning process. DCC integrates a multi-layer PIM abstraction that enables various data distribution and processing strategies on different PIM backends. DCC enables effective co-optimization by mapping data partitioning strategies to compute loop partitions, applying PIM-specific code optimizations and leveraging a fast and accurate performance prediction model to select optimal configurations. Our evaluations in various individual ML kernels demonstrate that DCC achieves up to 7.68x speedup (2.7x average) on HBM-PIM and up to 13.17x speedup (5.75x average) on AttAcc PIM backend over GPU-only execution. In end-to-end LLM inference, DCC on AttAcc accelerates GPT-3 and LLaMA-2 by up to 7.71x (4.88x average) over GPU.

Paper Structure

This paper contains 18 sections, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Near-bank PIM architecture and kernel execution workflow showing (1) input rearrangement, (2) computation execution, and (3) output rearrangement steps.
  • Figure 2: Normalized breakdown of compute and data rearrangement time in Reduction and GEMV comparing TVM-based compilation scheme (T) and a manually-tuned best-performing end-to-end implementation (B), at various matrix sizes. The numbers on each bar show speedup of B over T.
  • Figure 3: DCC overview with multiple PIM backend support.
  • Figure 4: Example schedule generation process for GEMV.
  • Figure 5: An example of adding DCC kernels to a GPT3 model.
  • ...and 5 more figures