Table of Contents
Fetching ...

Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models

Nathan Leroux, Paul-Philipp Manea, Chirag Sudarshan, Jan Finkbeiner, Sebastian Siegel, John Paul Strachan, Emre Neftci

TL;DR

A custom self-attention in-memory computing architecture based on emerging charge-based memories called gain cells, which can be efficiently written to store new tokens during sequence generation and enable parallel analog dot-product computation required for self-attention is presented.

Abstract

Transformer networks, driven by self-attention, are central to Large Language Models. In generative Transformers, self-attention uses cache memory to store token projections, avoiding recomputation at each time step. However, GPU-stored projections must be loaded into SRAM for each new generation step, causing latency and energy bottlenecks. We present a custom self-attention in-memory computing architecture based on emerging charge-based memories called gain cells, which can be efficiently written to store new tokens during sequence generation and enable parallel analog dot-product computation required for self-attention. However, the analog gain cell circuits introduce non-idealities and constraints preventing the direct mapping of pre-trained models. To circumvent this problem, we design an initialization algorithm achieving text processing performance comparable to GPT-2 without training from scratch. Our architecture respectively reduces attention latency and energy consumption by up to two and five orders of magnitude compared to GPUs, marking a significant step toward ultra-fast, low-power generative Transformers.

Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models

TL;DR

A custom self-attention in-memory computing architecture based on emerging charge-based memories called gain cells, which can be efficiently written to store new tokens during sequence generation and enable parallel analog dot-product computation required for self-attention is presented.

Abstract

Transformer networks, driven by self-attention, are central to Large Language Models. In generative Transformers, self-attention uses cache memory to store token projections, avoiding recomputation at each time step. However, GPU-stored projections must be loaded into SRAM for each new generation step, causing latency and energy bottlenecks. We present a custom self-attention in-memory computing architecture based on emerging charge-based memories called gain cells, which can be efficiently written to store new tokens during sequence generation and enable parallel analog dot-product computation required for self-attention. However, the analog gain cell circuits introduce non-idealities and constraints preventing the direct mapping of pre-trained models. To circumvent this problem, we design an initialization algorithm achieving text processing performance comparable to GPT-2 without training from scratch. Our architecture respectively reduces attention latency and energy consumption by up to two and five orders of magnitude compared to GPUs, marking a significant step toward ultra-fast, low-power generative Transformers.
Paper Structure (18 sections, 9 equations, 5 figures, 1 table)

This paper contains 18 sections, 9 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: (a) Multi-head attention mechanism. (b) Hardware architecture of the attention inference accelerator. (c) Circuit diagram of a gain cell-based signed weight multiplier cell used in the attention computation. (d) Diagram of the ReLU charge-to-pulse circuit. (e) Output current of gain cell based signed weight multiplier for different weight voltages $V_{store}$ for $V_{in}=0.9$ V. CMOS process variations are represented by the green shaded area. (f) Silicon CMOS capacitor voltage decay over time due to charge leakage. (g) Behavior of the ReLU charge-to-pulse circuit. (h) Behavior of the signed charge-to-pulse circuit.
  • Figure 2: (a) Three inference steps of a dot-product between $Q$ and $K$ in Sliding Window Attention. The grey boxes represent tokens that are attended to, and the blank boxes the masked tokens. (b) Equivalent gain cell array implementations for an entire attention head. A new column of $K$ and $V$ is written before each inference step. (c) Proposed pipeline, highlighting parallel operations of writing new $K-V$ pairs and performing the MAC operations. (d) Transient Simulation of the $\Phi(Q \cdot K^T)$ MAC operation including temporal location. (e) Transient Simulation results of the $\Phi(S) \cdot V$ MAC operation including the pulse and sign signal for the counter within the pipeline.
  • Figure 3: (a) Proposed hardware architecture for a single attention head, featuring the tiling of the attention head into multiple sub-tiles. The digital peripheral control is highlighted in yellow. (b) Floorplan of the architecture for one attention layer using OSFET technology assumptions. (c) Floorplan of one sub-tile. (d) Routing of one sub-tile highlighting the vertical propagation of signals by diagonal wire tapping.
  • Figure 4: (a) Pre-trained model mapping. From a software pre-trained model, we fine-tune an intermediate model that integrates all hardware constraints except dot-product nonlinearity. Then, we use a custom adaptation algorithm to map the intermediate model to the gain cell's nonlinearity. Finally, we fine-tune the model nonlinear model. (b) Sketch of the adaptation algorithm for scaling factors. Scaling factors re-scales the input before clipping and quantization. The nonlinear model leads to different statistics (red histogram) from the linear model (green histogram). The adaptation algorithm modifies the scaling factors to match the statistics of the nonlinear model to the the statistics of the linear one. (c) Evolution of perplexity (lower the better) during the adaptation algorithm. (d) Training curves for the different models. The software model is GPT-2, the nonlinear model is the model with the proposed hardware attention, and the linear model is the hardware attention with ideal linear gain cells.
  • Figure 5: (a) Comparison of expected results model versus Spice simulation results for the $\Phi(Q \cdot K^T)$ operation. (b) Comparison of PyTorch model versus SPICE simulation results for the $\Phi(S) \cdot V$ operation. (c) Latency of the attention mechanism for one processed token and (d) energy consumption for a twelve head attention mechanism implemented by a consumer GPU, an embedded application-specific GPU, and our hardware architecture. (e) Energy consumption ratio for the different modules of our hardware architecture, including analog and digital signals.