PIM-GPT: A Hybrid Process-in-Memory Accelerator for Autoregressive Transformers
Yuting Wu, Ziyu Wang, Wei D. Lu
TL;DR
PIM-GPT tackles the memory bottleneck in autoregressive GPT inference by introducing a hybrid DRAM-based process-in-memory (PIM) architecture paired with a lightweight ASIC. It achieves end-to-end GPT acceleration by performing VMM directly in DRAM banks while offloading non-linear and coordination tasks to the ASIC, guided by dataflow and workload-mapping strategies that maximize row locality and parallelism. The approach yields state-of-the-art performance and energy efficiency, with 41–137x speedups over GPUs and up to 1074x over CPUs, and 123–383x (GPU) and 320–602x (CPU) improvements in energy efficiency across eight GPT models up to 1.4B parameters. This hybrid design reduces off-chip data movement, avoids expensive HBM integration, and demonstrates practical, scalable end-to-end GPT acceleration suitable for memory-bound transformer workloads.
Abstract
Decoder-only Transformer models such as GPT have demonstrated exceptional performance in text generation, by autoregressively predicting the next token. However, the efficacy of running GPT on current hardware systems is bounded by low compute-to-memory-ratio and high memory access. Process-in-memory (PIM) architectures can minimize off-chip data movement and utilize high internal bandwidth. They stand out as promising candidates for accelerating memory-bounded tasks such as GPT inference. In this work, we propose a PIM accelerator, PIM-GPT, which achieves end-to-end acceleration of GPT inference with high performance and high energy efficiency. PIM-GPT leverages DRAM-based PIM designs for executing multiply-accumulate (MAC) operations directly in the DRAM chips, eliminating the need to move matrix data off-chip. Non-linear functions and data communication is supported by an application specific integrated chip (ASIC). At the software level, mapping schemes are designed to maximize data locality and computation parallelism by concatenating and partitioning matrices among DRAM channels and banks to utilize all available in-memory computation units. The efficiency of the PIM-GPT architecture is verified through circuit synthesis and an event-driven clock-cycle accurate simulator. Overall, PIM-GPT achieves 41$-$137$\times$, 631$-$1074$\times$ speedup and 123$-$383$\times$, 320$-$602$\times$ energy efficiency over GPU and CPU baseline on 8 GPT models with up to 1.4 billion parameters.
