Table of Contents
Fetching ...

HPIM: Heterogeneous Processing-In-Memory-based Accelerator for Large Language Models Inference

Cenlin Duan, Jianlei Yang, Rubing Yang, Yikun Wang, Yiou Wang, Lingkun Long, Yingjie Qi, Xiaolin He, Ao Zhou, Xueyan Wang, Weisheng Zhao

TL;DR

HPIM tackles the memory-bound decoding bottleneck of single-batch LLM inference by coupling SRAM-PIM and HBM-PIM in a memory-centric heterogeneous PIM architecture, guided by a hardware-aware compiler. The approach partitions latency-sensitive attention and nonlinear operations to SRAM-PIM while offloading weight-heavy GEMV to HBM-PIM, enabling intra-token parallelism through tight pipelining. Evaluations on OPT models show substantial gains, including a peak end-to-end speedup of $34.3\times$ over the NVIDIA A100 and clear advantages over state-of-the-art PIM systems such as IANUS and CXL-PNM. These results demonstrate that memory-centric heterogeneous PIM is a practical and scalable path for accelerating large-scale LLM inference.

Abstract

The deployment of large language models (LLMs) presents significant challenges due to their enormous memory footprints, low arithmetic intensity, and stringent latency requirements, particularly during the autoregressive decoding stage. Traditional compute-centric accelerators, such as GPUs, suffer from severe resource underutilization and memory bandwidth bottlenecks in these memory-bound workloads. To overcome these fundamental limitations, we propose HPIM, the first memory-centric heterogeneous Processing-In-Memory (PIM) accelerator that integrates SRAM-PIM and HBM-PIM subsystems designed specifically for LLM inference. HPIM employs a software-hardware co-design approach that combines a specialized compiler framework with a heterogeneous hardware architecture. It intelligently partitions workloads based on their characteristics: latency-critical attention operations are mapped to the SRAM-PIM subsystem to exploit its ultra-low latency and high computational flexibility, while weight-intensive GEMV computations are assigned to the HBM-PIM subsystem to leverage its high internal bandwidth and large storage capacity. Furthermore, HPIM introduces a tightly coupled pipeline strategy across SRAM-PIM and HBM-PIM subsystems to maximize intra-token parallelism, thereby significantly mitigating serial dependency of the autoregressive decoding stage. Comprehensive evaluations using a cycle-accurate simulator demonstrate that HPIM significantly outperforms state-of-the-art accelerators, achieving a peak speedup of up to 22.8x compared to the NVIDIA A100 GPU. Moreover, HPIM exhibits superior performance over contemporary PIM-based accelerators, highlighting its potential as a highly practical and scalable solution for accelerating large-scale LLM inference.

HPIM: Heterogeneous Processing-In-Memory-based Accelerator for Large Language Models Inference

TL;DR

HPIM tackles the memory-bound decoding bottleneck of single-batch LLM inference by coupling SRAM-PIM and HBM-PIM in a memory-centric heterogeneous PIM architecture, guided by a hardware-aware compiler. The approach partitions latency-sensitive attention and nonlinear operations to SRAM-PIM while offloading weight-heavy GEMV to HBM-PIM, enabling intra-token parallelism through tight pipelining. Evaluations on OPT models show substantial gains, including a peak end-to-end speedup of over the NVIDIA A100 and clear advantages over state-of-the-art PIM systems such as IANUS and CXL-PNM. These results demonstrate that memory-centric heterogeneous PIM is a practical and scalable path for accelerating large-scale LLM inference.

Abstract

The deployment of large language models (LLMs) presents significant challenges due to their enormous memory footprints, low arithmetic intensity, and stringent latency requirements, particularly during the autoregressive decoding stage. Traditional compute-centric accelerators, such as GPUs, suffer from severe resource underutilization and memory bandwidth bottlenecks in these memory-bound workloads. To overcome these fundamental limitations, we propose HPIM, the first memory-centric heterogeneous Processing-In-Memory (PIM) accelerator that integrates SRAM-PIM and HBM-PIM subsystems designed specifically for LLM inference. HPIM employs a software-hardware co-design approach that combines a specialized compiler framework with a heterogeneous hardware architecture. It intelligently partitions workloads based on their characteristics: latency-critical attention operations are mapped to the SRAM-PIM subsystem to exploit its ultra-low latency and high computational flexibility, while weight-intensive GEMV computations are assigned to the HBM-PIM subsystem to leverage its high internal bandwidth and large storage capacity. Furthermore, HPIM introduces a tightly coupled pipeline strategy across SRAM-PIM and HBM-PIM subsystems to maximize intra-token parallelism, thereby significantly mitigating serial dependency of the autoregressive decoding stage. Comprehensive evaluations using a cycle-accurate simulator demonstrate that HPIM significantly outperforms state-of-the-art accelerators, achieving a peak speedup of up to 22.8x compared to the NVIDIA A100 GPU. Moreover, HPIM exhibits superior performance over contemporary PIM-based accelerators, highlighting its potential as a highly practical and scalable solution for accelerating large-scale LLM inference.

Paper Structure

This paper contains 21 sections, 1 equation, 13 figures, 4 tables, 1 algorithm.

Figures (13)

  • Figure 1: The diverse requirements of LLM inference, the corresponding trade-offs in "PIM pyramid", and the advantages of heterogeneous PIM architecture.
  • Figure 2: Model architecture and inference process of LLM.
  • Figure 3: Execution breakdown of OPT-13B inference on an A100 GPU. (a) Component-level breakdown across the prefill (GEMM-bound) and decode (GEMV-bound) stages. (b) Operator-level breakdown over the entire LLM inference process (e.g., GEMM, GEMV, Softmax). The results are obtained using OPT-13B with an input length of 512 tokens and an output length of 32 tokens. It shows that the overall execution is overwhelmingly dominated by the GEMV-centric decode stage ($73.8\%$), highlighting it as the primary performance bottleneck.
  • Figure 4: Roofline model of OPT-$6.7$B (Bright), OPT-$13$B (Moderate), and OPT-$30$B (Dark) operations on an A100 GPU with a sequence length of $2048$. The points plot the performance of Attention (circle) and QKV Generation (pentagram) during the prefill (orange) and decode (green) phases.
  • Figure 5: Overview of the HPIM accelerator, including its two main components: HPIM compiler and HPIM architecture. The workflow starts from user-defined model descriptions and architectural configurations, which are processed by the HPIM compiler to generate executable instructions. These instructions are then run on the HPIM architecture, which features a tightly-coupled design of an SRAM-PIM subsystem and an HBM-PIM subsystem.
  • ...and 8 more figures