Table of Contents
Fetching ...

AttentionLego: An Open-Source Building Block For Spatially-Scalable Large Language Model Accelerator With Processing-In-Memory Technology

Rongqing Cong, Wenyang He, Mingxuan Li, Bangning Luo, Zebin Yang, Yuchao Yang, Ru Huang, Bonan Yan

TL;DR

This work tackles the data movement and compute demands of self-attention in Transformer-based LLMs by introducing AttentionLego, a vanilla self-attention accelerator implemented with Processing-In-Memory (PIM) to enable spatially scalable LLM processors. The design decomposes the attention computation into five modular blocks—Input Process, Score, Softmax, DMA, and Top Controller—built around a PIM-based matrix-vector multiply and a LUT-based Softmax to reduce I/O bottlenecks and improve energy efficiency. Key contributions include a tileable architecture with 32 APIM modules storing 128×128 weight matrices, a CIM-based computation path for Q/K/V, a scalable Score engine that forms a 128×2048 QK^T, a LUT-driven Softmax, and a DMA-driven data flow tuned for on/off-chip bandwidth; the matrix multiply can complete in 64 clock cycles, enabling token-level inference pipelines. By loading parameters once and exploiting weight-stationary dataflow onPIM macros, AttentionLego offers a practical building block for spatially expandable LLM accelerators and provides open-source code to accelerate hardware exploration and IoT-oriented AI deployments.

Abstract

Large language models (LLMs) with Transformer architectures have become phenomenal in natural language processing, multimodal generative artificial intelligence, and agent-oriented artificial intelligence. The self-attention module is the most dominating sub-structure inside Transformer-based LLMs. Computation using general-purpose graphics processing units (GPUs) inflicts reckless demand for I/O bandwidth for transferring intermediate calculation results between memories and processing units. To tackle this challenge, this work develops a fully customized vanilla self-attention accelerator, AttentionLego, as the basic building block for constructing spatially expandable LLM processors. AttentionLego provides basic implementation with fully-customized digital logic incorporating Processing-In-Memory (PIM) technology. It is based on PIM-based matrix-vector multiplication and look-up table-based Softmax design. The open-source code is available online: https://bonany.cc/attentionleg.

AttentionLego: An Open-Source Building Block For Spatially-Scalable Large Language Model Accelerator With Processing-In-Memory Technology

TL;DR

This work tackles the data movement and compute demands of self-attention in Transformer-based LLMs by introducing AttentionLego, a vanilla self-attention accelerator implemented with Processing-In-Memory (PIM) to enable spatially scalable LLM processors. The design decomposes the attention computation into five modular blocks—Input Process, Score, Softmax, DMA, and Top Controller—built around a PIM-based matrix-vector multiply and a LUT-based Softmax to reduce I/O bottlenecks and improve energy efficiency. Key contributions include a tileable architecture with 32 APIM modules storing 128×128 weight matrices, a CIM-based computation path for Q/K/V, a scalable Score engine that forms a 128×2048 QK^T, a LUT-driven Softmax, and a DMA-driven data flow tuned for on/off-chip bandwidth; the matrix multiply can complete in 64 clock cycles, enabling token-level inference pipelines. By loading parameters once and exploiting weight-stationary dataflow onPIM macros, AttentionLego offers a practical building block for spatially expandable LLM accelerators and provides open-source code to accelerate hardware exploration and IoT-oriented AI deployments.

Abstract

Large language models (LLMs) with Transformer architectures have become phenomenal in natural language processing, multimodal generative artificial intelligence, and agent-oriented artificial intelligence. The self-attention module is the most dominating sub-structure inside Transformer-based LLMs. Computation using general-purpose graphics processing units (GPUs) inflicts reckless demand for I/O bandwidth for transferring intermediate calculation results between memories and processing units. To tackle this challenge, this work develops a fully customized vanilla self-attention accelerator, AttentionLego, as the basic building block for constructing spatially expandable LLM processors. AttentionLego provides basic implementation with fully-customized digital logic incorporating Processing-In-Memory (PIM) technology. It is based on PIM-based matrix-vector multiplication and look-up table-based Softmax design. The open-source code is available online: https://bonany.cc/attentionleg.
Paper Structure (12 sections, 2 equations, 11 figures, 1 table)

This paper contains 12 sections, 2 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Operation number breakdown for popular large language models. Self-attention module dominates the operation counts in LLMs.1 Multiply-Accumulate (MAC) counts 2 operations. We unify the operation counts for floating-point numbers and integers.
  • Figure 2: Processing-in-memory macro to perform in situ general matrix-vector multiplication.
  • Figure 3: (a) Basic block diagram and calculations for the self-attention module. (b) LLaMA 2 model architecture diagram touvron_llama_2023.
  • Figure 4: Core idea and the spatial scalability of AttentionLego.
  • Figure 5: Architecture of AttentionLego
  • ...and 6 more figures