Token-Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation
Junyoung Park, Myeonggu Kang, Yunki Han, Yanggon Kim, Jaekang Shin, Lee-Sup Kim
TL;DR
This work tackles the memory bottleneck of attention in autoregressive text generation by introducing probability-estimation pruning that operates before softmax to remove tokens with near-zero probabilities, achieving a $12.1\times$ average pruning ratio without retraining. It couples this with an out-of-order score calculation and a specialized hardware design, ToPick, that streams KV chunks on demand and computes partial scores to minimize off-chip memory transfers. The approach yields substantial results: up to $2.57\times$ overall off-chip memory reduction, $2.28\times$ speedup, and $2.41\times$ energy efficiency in generation, with detailed hardware evaluation showing modest area/power overhead. Together, the probability-estimation method and the ToPick architecture offer practical, instance-adaptive pruning and memory-efficient self-attention for large language model generation.
Abstract
The attention mechanism in text generation is memory-bounded due to its sequential characteristics. Therefore, off-chip memory accesses should be minimized for faster execution. Although previous methods addressed this by pruning unimportant tokens, they fall short in selectively removing tokens with near-zero attention probabilities in each instance. Our method estimates the probability before the softmax function, effectively removing low probability tokens and achieving an 12.1x pruning ratio without fine-tuning. Additionally, we present a hardware design supporting seamless on-demand off-chip access. Our approach shows 2.6x reduced memory accesses, leading to an average 2.3x speedup and a 2.4x energy efficiency.
