PIM-LLM: A High-Throughput Hybrid PIM Architecture for 1-bit LLMs
Jinendra Malekar, Peyton Chandarana, Md Hasibul Amin, Mohammed E. Elbtity, Ramtin Zand
TL;DR
PIM-LLM addresses the challenge of accelerating decoder-only LLMs with extreme 1-bit quantization by proposing a hybrid architecture that combines analog PIM with a digital tensor processing unit. It accelerates low-precision projection-layer MatMuls via memristive crossbars and 1-bit weights, while running high-precision 8-bit attention head MatMuls on a dedicated 8-bit TPU, using an output-stationary dataflow and LPDDR memory for edge efficiency. The results show up to $80×$ tokens-per-second throughput, $70%$ tokens-per-joule energy efficiency, and at least $2×$ and $5×$ gains in GOPS and GOPS/W over prior PIM accelerators, with stronger benefits for larger models and longer contexts. This hybrid approach thus enables high-throughput, energy-efficient edge inference for 1-bit LLMs and provides a blueprint for future PIM-based decoder architectures.
Abstract
In this paper, we propose PIM-LLM, a hybrid architecture developed to accelerate 1-bit large language models (LLMs). PIM-LLM leverages analog processing-in-memory (PIM) architectures and digital systolic arrays to accelerate low-precision matrix multiplication (MatMul) operations in projection layers and high-precision MatMul operations in attention heads of 1-bit LLMs, respectively. Our design achieves up to roughly 80x improvement in tokens per second and a 70% increase in tokens per joule compared to conventional hardware accelerators. Additionally, PIM-LLM outperforms previous PIM-based LLM accelerators, setting a new benchmark with at least 2x and 5x improvement in GOPS and GOPS/W, respectively.
