Table of Contents
Fetching ...

Token-Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation

Junyoung Park, Myeonggu Kang, Yunki Han, Yanggon Kim, Jaekang Shin, Lee-Sup Kim

TL;DR

This work tackles the memory bottleneck of attention in autoregressive text generation by introducing probability-estimation pruning that operates before softmax to remove tokens with near-zero probabilities, achieving a $12.1\times$ average pruning ratio without retraining. It couples this with an out-of-order score calculation and a specialized hardware design, ToPick, that streams KV chunks on demand and computes partial scores to minimize off-chip memory transfers. The approach yields substantial results: up to $2.57\times$ overall off-chip memory reduction, $2.28\times$ speedup, and $2.41\times$ energy efficiency in generation, with detailed hardware evaluation showing modest area/power overhead. Together, the probability-estimation method and the ToPick architecture offer practical, instance-adaptive pruning and memory-efficient self-attention for large language model generation.

Abstract

The attention mechanism in text generation is memory-bounded due to its sequential characteristics. Therefore, off-chip memory accesses should be minimized for faster execution. Although previous methods addressed this by pruning unimportant tokens, they fall short in selectively removing tokens with near-zero attention probabilities in each instance. Our method estimates the probability before the softmax function, effectively removing low probability tokens and achieving an 12.1x pruning ratio without fine-tuning. Additionally, we present a hardware design supporting seamless on-demand off-chip access. Our approach shows 2.6x reduced memory accesses, leading to an average 2.3x speedup and a 2.4x energy efficiency.

Token-Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation

TL;DR

This work tackles the memory bottleneck of attention in autoregressive text generation by introducing probability-estimation pruning that operates before softmax to remove tokens with near-zero probabilities, achieving a average pruning ratio without retraining. It couples this with an out-of-order score calculation and a specialized hardware design, ToPick, that streams KV chunks on demand and computes partial scores to minimize off-chip memory transfers. The approach yields substantial results: up to overall off-chip memory reduction, speedup, and energy efficiency in generation, with detailed hardware evaluation showing modest area/power overhead. Together, the probability-estimation method and the ToPick architecture offer practical, instance-adaptive pruning and memory-efficient self-attention for large language model generation.

Abstract

The attention mechanism in text generation is memory-bounded due to its sequential characteristics. Therefore, off-chip memory accesses should be minimized for faster execution. Although previous methods addressed this by pruning unimportant tokens, they fall short in selectively removing tokens with near-zero attention probabilities in each instance. Our method estimates the probability before the softmax function, effectively removing low probability tokens and achieving an 12.1x pruning ratio without fine-tuning. Additionally, we present a hardware design supporting seamless on-demand off-chip access. Our approach shows 2.6x reduced memory accesses, leading to an average 2.3x speedup and a 2.4x energy efficiency.
Paper Structure (23 sections, 3 equations, 10 figures, 2 tables)

This paper contains 23 sections, 3 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Transformer-based autoregressive text generation.
  • Figure 2: Memory transfer breakdown.
  • Figure 3: Various attention score distribution.
  • Figure 4: (a) Heatmap of attention probability across token indices in text generation, where the middle column aggregates probabilities for tokens from 1 to t-10. (b) Margins from partial score where true result exist. $s^b$ indicates partial score of chunk index $b$. $M^b_{min}$ and $M^b_{max}$ imply margins for the minimum and maximum values, respectively.
  • Figure 5: Out-of-Order Score Calculation
  • ...and 5 more figures