Token-Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation

Junyoung Park; Myeonggu Kang; Yunki Han; Yanggon Kim; Jaekang Shin; Lee-Sup Kim

Token-Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation

Junyoung Park, Myeonggu Kang, Yunki Han, Yanggon Kim, Jaekang Shin, Lee-Sup Kim

TL;DR

This work tackles the memory bottleneck of attention in autoregressive text generation by introducing probability-estimation pruning that operates before softmax to remove tokens with near-zero probabilities, achieving a $12.1\times$ average pruning ratio without retraining. It couples this with an out-of-order score calculation and a specialized hardware design, ToPick, that streams KV chunks on demand and computes partial scores to minimize off-chip memory transfers. The approach yields substantial results: up to $2.57\times$ overall off-chip memory reduction, $2.28\times$ speedup, and $2.41\times$ energy efficiency in generation, with detailed hardware evaluation showing modest area/power overhead. Together, the probability-estimation method and the ToPick architecture offer practical, instance-adaptive pruning and memory-efficient self-attention for large language model generation.

Abstract

The attention mechanism in text generation is memory-bounded due to its sequential characteristics. Therefore, off-chip memory accesses should be minimized for faster execution. Although previous methods addressed this by pruning unimportant tokens, they fall short in selectively removing tokens with near-zero attention probabilities in each instance. Our method estimates the probability before the softmax function, effectively removing low probability tokens and achieving an 12.1x pruning ratio without fine-tuning. Additionally, we present a hardware design supporting seamless on-demand off-chip access. Our approach shows 2.6x reduced memory accesses, leading to an average 2.3x speedup and a 2.4x energy efficiency.

Token-Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation

TL;DR

average pruning ratio without retraining. It couples this with an out-of-order score calculation and a specialized hardware design, ToPick, that streams KV chunks on demand and computes partial scores to minimize off-chip memory transfers. The approach yields substantial results: up to

overall off-chip memory reduction,

speedup, and

energy efficiency in generation, with detailed hardware evaluation showing modest area/power overhead. Together, the probability-estimation method and the ToPick architecture offer practical, instance-adaptive pruning and memory-efficient self-attention for large language model generation.

Abstract

Paper Structure (23 sections, 3 equations, 10 figures, 2 tables)

This paper contains 23 sections, 3 equations, 10 figures, 2 tables.

Introduction
BACKGROUND & MOTIVATION
Autoregressive transformer model
Transformer architecture
KV caching
Motivation
Memory transfer overhead
Distribution-aligned pruning
Proposed Work
Probability Estimation
Out-of-order Score Calculation
ToPick Architecture
Microarchitecture of PE Lane
Experiments
Experimental Setup
...and 8 more sections

Figures (10)

Figure 1: Transformer-based autoregressive text generation.
Figure 2: Memory transfer breakdown.
Figure 3: Various attention score distribution.
Figure 4: (a) Heatmap of attention probability across token indices in text generation, where the middle column aggregates probabilities for tokens from 1 to t-10. (b) Margins from partial score where true result exist. $s^b$ indicates partial score of chunk index $b$. $M^b_{min}$ and $M^b_{max}$ imply margins for the minimum and maximum values, respectively.
Figure 5: Out-of-Order Score Calculation
...and 5 more figures

Token-Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation

TL;DR

Abstract

Token-Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)