Table of Contents
Fetching ...

PIM-LLM: A High-Throughput Hybrid PIM Architecture for 1-bit LLMs

Jinendra Malekar, Peyton Chandarana, Md Hasibul Amin, Mohammed E. Elbtity, Ramtin Zand

TL;DR

PIM-LLM addresses the challenge of accelerating decoder-only LLMs with extreme 1-bit quantization by proposing a hybrid architecture that combines analog PIM with a digital tensor processing unit. It accelerates low-precision projection-layer MatMuls via memristive crossbars and 1-bit weights, while running high-precision 8-bit attention head MatMuls on a dedicated 8-bit TPU, using an output-stationary dataflow and LPDDR memory for edge efficiency. The results show up to $80×$ tokens-per-second throughput, $70%$ tokens-per-joule energy efficiency, and at least $2×$ and $5×$ gains in GOPS and GOPS/W over prior PIM accelerators, with stronger benefits for larger models and longer contexts. This hybrid approach thus enables high-throughput, energy-efficient edge inference for 1-bit LLMs and provides a blueprint for future PIM-based decoder architectures.

Abstract

In this paper, we propose PIM-LLM, a hybrid architecture developed to accelerate 1-bit large language models (LLMs). PIM-LLM leverages analog processing-in-memory (PIM) architectures and digital systolic arrays to accelerate low-precision matrix multiplication (MatMul) operations in projection layers and high-precision MatMul operations in attention heads of 1-bit LLMs, respectively. Our design achieves up to roughly 80x improvement in tokens per second and a 70% increase in tokens per joule compared to conventional hardware accelerators. Additionally, PIM-LLM outperforms previous PIM-based LLM accelerators, setting a new benchmark with at least 2x and 5x improvement in GOPS and GOPS/W, respectively.

PIM-LLM: A High-Throughput Hybrid PIM Architecture for 1-bit LLMs

TL;DR

PIM-LLM addresses the challenge of accelerating decoder-only LLMs with extreme 1-bit quantization by proposing a hybrid architecture that combines analog PIM with a digital tensor processing unit. It accelerates low-precision projection-layer MatMuls via memristive crossbars and 1-bit weights, while running high-precision 8-bit attention head MatMuls on a dedicated 8-bit TPU, using an output-stationary dataflow and LPDDR memory for edge efficiency. The results show up to tokens-per-second throughput, tokens-per-joule energy efficiency, and at least and gains in GOPS and GOPS/W over prior PIM accelerators, with stronger benefits for larger models and longer contexts. This hybrid approach thus enables high-throughput, energy-efficient edge inference for 1-bit LLMs and provides a blueprint for future PIM-based decoder architectures.

Abstract

In this paper, we propose PIM-LLM, a hybrid architecture developed to accelerate 1-bit large language models (LLMs). PIM-LLM leverages analog processing-in-memory (PIM) architectures and digital systolic arrays to accelerate low-precision matrix multiplication (MatMul) operations in projection layers and high-precision MatMul operations in attention heads of 1-bit LLMs, respectively. Our design achieves up to roughly 80x improvement in tokens per second and a 70% increase in tokens per joule compared to conventional hardware accelerators. Additionally, PIM-LLM outperforms previous PIM-based LLM accelerators, setting a new benchmark with at least 2x and 5x improvement in GOPS and GOPS/W, respectively.

Paper Structure

This paper contains 13 sections, 2 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: (a) The 1-bit LLMs divide the model into two portions: attention heads with high-precision MatMul operations (shown in red) and projection layers with low-precision MatMuls (shown in green). (b) The percentage of the low-precision MatMul operations in various OPT models.
  • Figure 2: The architecture of decoder-only LLMs. The tokenization and embedding layers are not shown in the figure.
  • Figure 3: The proposed PIM-LLM architecture. (a) The LLM-specific TPU architecture, (b) The PIM architecture with multiple banks. (b) The PIM tile consists of a network of PEs. (c) The PEs include memristive crossbars to perform MVM operations.
  • Figure 4: Total cycles required for executing various LLMs using $32\times32$ systolic arrays with different dataflow architectures.
  • Figure 5: Tokens per second result for various LLMs with different context lengths ($l$).
  • ...and 3 more figures