Table of Contents
Fetching ...

PIM-GPT: A Hybrid Process-in-Memory Accelerator for Autoregressive Transformers

Yuting Wu, Ziyu Wang, Wei D. Lu

TL;DR

PIM-GPT tackles the memory bottleneck in autoregressive GPT inference by introducing a hybrid DRAM-based process-in-memory (PIM) architecture paired with a lightweight ASIC. It achieves end-to-end GPT acceleration by performing VMM directly in DRAM banks while offloading non-linear and coordination tasks to the ASIC, guided by dataflow and workload-mapping strategies that maximize row locality and parallelism. The approach yields state-of-the-art performance and energy efficiency, with 41–137x speedups over GPUs and up to 1074x over CPUs, and 123–383x (GPU) and 320–602x (CPU) improvements in energy efficiency across eight GPT models up to 1.4B parameters. This hybrid design reduces off-chip data movement, avoids expensive HBM integration, and demonstrates practical, scalable end-to-end GPT acceleration suitable for memory-bound transformer workloads.

Abstract

Decoder-only Transformer models such as GPT have demonstrated exceptional performance in text generation, by autoregressively predicting the next token. However, the efficacy of running GPT on current hardware systems is bounded by low compute-to-memory-ratio and high memory access. Process-in-memory (PIM) architectures can minimize off-chip data movement and utilize high internal bandwidth. They stand out as promising candidates for accelerating memory-bounded tasks such as GPT inference. In this work, we propose a PIM accelerator, PIM-GPT, which achieves end-to-end acceleration of GPT inference with high performance and high energy efficiency. PIM-GPT leverages DRAM-based PIM designs for executing multiply-accumulate (MAC) operations directly in the DRAM chips, eliminating the need to move matrix data off-chip. Non-linear functions and data communication is supported by an application specific integrated chip (ASIC). At the software level, mapping schemes are designed to maximize data locality and computation parallelism by concatenating and partitioning matrices among DRAM channels and banks to utilize all available in-memory computation units. The efficiency of the PIM-GPT architecture is verified through circuit synthesis and an event-driven clock-cycle accurate simulator. Overall, PIM-GPT achieves 41$-$137$\times$, 631$-$1074$\times$ speedup and 123$-$383$\times$, 320$-$602$\times$ energy efficiency over GPU and CPU baseline on 8 GPT models with up to 1.4 billion parameters.

PIM-GPT: A Hybrid Process-in-Memory Accelerator for Autoregressive Transformers

TL;DR

PIM-GPT tackles the memory bottleneck in autoregressive GPT inference by introducing a hybrid DRAM-based process-in-memory (PIM) architecture paired with a lightweight ASIC. It achieves end-to-end GPT acceleration by performing VMM directly in DRAM banks while offloading non-linear and coordination tasks to the ASIC, guided by dataflow and workload-mapping strategies that maximize row locality and parallelism. The approach yields state-of-the-art performance and energy efficiency, with 41–137x speedups over GPUs and up to 1074x over CPUs, and 123–383x (GPU) and 320–602x (CPU) improvements in energy efficiency across eight GPT models up to 1.4B parameters. This hybrid design reduces off-chip data movement, avoids expensive HBM integration, and demonstrates practical, scalable end-to-end GPT acceleration suitable for memory-bound transformer workloads.

Abstract

Decoder-only Transformer models such as GPT have demonstrated exceptional performance in text generation, by autoregressively predicting the next token. However, the efficacy of running GPT on current hardware systems is bounded by low compute-to-memory-ratio and high memory access. Process-in-memory (PIM) architectures can minimize off-chip data movement and utilize high internal bandwidth. They stand out as promising candidates for accelerating memory-bounded tasks such as GPT inference. In this work, we propose a PIM accelerator, PIM-GPT, which achieves end-to-end acceleration of GPT inference with high performance and high energy efficiency. PIM-GPT leverages DRAM-based PIM designs for executing multiply-accumulate (MAC) operations directly in the DRAM chips, eliminating the need to move matrix data off-chip. Non-linear functions and data communication is supported by an application specific integrated chip (ASIC). At the software level, mapping schemes are designed to maximize data locality and computation parallelism by concatenating and partitioning matrices among DRAM channels and banks to utilize all available in-memory computation units. The efficiency of the PIM-GPT architecture is verified through circuit synthesis and an event-driven clock-cycle accurate simulator. Overall, PIM-GPT achieves 41137, 6311074 speedup and 123383, 320602 energy efficiency over GPU and CPU baseline on 8 GPT models with up to 1.4 billion parameters.
Paper Structure (20 sections, 1 equation, 15 figures, 2 tables, 3 algorithms)

This paper contains 20 sections, 1 equation, 15 figures, 2 tables, 3 algorithms.

Figures (15)

  • Figure 1: (a) Parameter and computation cost comparisons of GPTs and ResNet-18. (b) Operation/parameter ratios of CNN and GPT models.
  • Figure 2: Transformer architectures of BERT and GPT.
  • Figure 3: PIM-GPT system overview. (a) Hardware-aware GPT model partition. (b) Compilation of computation stream to command stream. (c) A complete PIM-GPT hardware architecture.
  • Figure 4: Multi-bank mapping scheme for VMM operation. Colors represent the different rows in the original matrix.
  • Figure 5: DRAM PIM organization. (a) A channel is composed of a global buffer and 16 banks. A bank contains (b) a conventional DRAM bank and (c) a MAC unit with multipliers and an adder tree.
  • ...and 10 more figures