Table of Contents
Fetching ...

Balanced Data Placement for GEMV Acceleration with Processing-In-Memory

Mohamed Assem Ibrahim, Mahzabeen Islam, Shaizeen Aga

TL;DR

This work tackles the memory bandwidth bottleneck of GEMV during GenAI inference on edge devices by leveraging commercially viable PIM designs. It introduces PIMnast, a data-placement and tiling methodology that balances architecture, memory configuration, and GenAI/GEMV needs to maximize GEMV-PIM throughput, augmented by orchestration techniques like register allocation and input-vector reuse. Through analytical models and targeted evaluations, PIMnast achieves up to $6.86\times$ GEMV speedups (near the $7\times$ roofline) and up to $5\times$ reductions in per-token latency, with robust end-to-end gains on a spectrum of GenAI models. The approach demonstrates practical viability for deploying larger, lower-precision GenAI models on client devices by exploiting DRAM row-locality, inter-bank broadcasts, and large page interleaving, offering a path to substantial on-device GenAI acceleration.

Abstract

With unprecedented demand for generative AI (GenAI) inference, acceleration of primitives that dominate GenAI such as general matrix-vector multiplication (GEMV) is receiving considerable attention. A challenge with GEMVs is the high memory bandwidth this primitive demands. Multiple memory vendors have proposed commercially viable processing-in-memory (PIM) prototypes that attain bandwidth boost over processor via augmenting memory banks with compute capabilities and broadcasting same command to all banks. While proposed PIM designs stand to accelerate GEMV, we observe in this work that a key impediment to truly harness PIM acceleration is deducing optimal data-placement to place the matrix in memory banks. To this end, we tease out several factors that impact data-placement and propose PIMnast methodology which, like a gymnast, balances these factors to identify data-placements that deliver GEMV acceleration. Across a spectrum of GenAI models, our proposed PIMnast methodology along with additional orchestration knobs we identify delivers up to 6.86$\times$ speedup for GEMVs (of the available 7$\times$ roofline speedup) leading to up to 5$\times$ speedup for per-token latencies.

Balanced Data Placement for GEMV Acceleration with Processing-In-Memory

TL;DR

This work tackles the memory bandwidth bottleneck of GEMV during GenAI inference on edge devices by leveraging commercially viable PIM designs. It introduces PIMnast, a data-placement and tiling methodology that balances architecture, memory configuration, and GenAI/GEMV needs to maximize GEMV-PIM throughput, augmented by orchestration techniques like register allocation and input-vector reuse. Through analytical models and targeted evaluations, PIMnast achieves up to GEMV speedups (near the roofline) and up to reductions in per-token latency, with robust end-to-end gains on a spectrum of GenAI models. The approach demonstrates practical viability for deploying larger, lower-precision GenAI models on client devices by exploiting DRAM row-locality, inter-bank broadcasts, and large page interleaving, offering a path to substantial on-device GenAI acceleration.

Abstract

With unprecedented demand for generative AI (GenAI) inference, acceleration of primitives that dominate GenAI such as general matrix-vector multiplication (GEMV) is receiving considerable attention. A challenge with GEMVs is the high memory bandwidth this primitive demands. Multiple memory vendors have proposed commercially viable processing-in-memory (PIM) prototypes that attain bandwidth boost over processor via augmenting memory banks with compute capabilities and broadcasting same command to all banks. While proposed PIM designs stand to accelerate GEMV, we observe in this work that a key impediment to truly harness PIM acceleration is deducing optimal data-placement to place the matrix in memory banks. To this end, we tease out several factors that impact data-placement and propose PIMnast methodology which, like a gymnast, balances these factors to identify data-placements that deliver GEMV acceleration. Across a spectrum of GenAI models, our proposed PIMnast methodology along with additional orchestration knobs we identify delivers up to 6.86 speedup for GEMVs (of the available 7 roofline speedup) leading to up to 5 speedup for per-token latencies.
Paper Structure (46 sections, 15 figures, 1 table, 3 algorithms)

This paper contains 46 sections, 15 figures, 1 table, 3 algorithms.

Figures (15)

  • Figure 1: PIMnast balances myriad factors to identify data-placement delivering GEMV-PIM acceleration.
  • Figure 2: (a) GenAI inference phases. (b) LPDDR-PIM overview.
  • Figure 3: (a) Baseline vs. PIM GEMV. (b) Steps in PIM GEMV.
  • Figure 4: Factors impacting data-placement.
  • Figure 5: Tackling of data-placement factors with PIMnast.
  • ...and 10 more figures