Value-State Gated Attention for Mitigating Extreme-Token Phenomena in Transformers
Rui Bu, Haofeng Zhong, Wenzheng Chen, Yangyan Li
TL;DR
This paper tackles extreme-token phenomena in Transformers by diagnosing a mutual reinforcement cycle between attention sinks and value-state drains. It introduces Value-State Gated Attention (VGA), a lightweight, reactive gate derived from value states that directly modulates token contributions to attention outputs, breaking the pathological feedback loop. The authors show analytically that VGA severs gradient flow to offending value states when sinks form, and they validate the approach with synthetic and real-model experiments, demonstrating improved stability, interpretability, and post-training quantization robustness. VGA consistently outperforms baselines across a range of models and tasks, and can retrofit existing models with substantial stability gains at low additional compute, making it a practical enhancement for future large-scale transformers.
Abstract
Large models based on the Transformer architecture are susceptible to extreme-token phenomena, such as attention sinks and value-state drains. These issues, which degrade model performance, quantization fidelity, and interpretability, arise from a problematic mutual reinforcement mechanism where the model learns an inefficient 'no-op' behavior by focusing attention on tokens with near-zero value states. In this paper, we propose Value-State Gated Attention (VGA), a simple, dedicated, and stable architectural mechanism for performing 'no-op' attention efficiently by directly breaking this cycle. VGA introduces a learnable, data-dependent gate, computed directly from the value vectors (V), to modulate the output. Through a theoretical analysis of the underlying gradients, we show that gating the value-state with a function of itself is more effective at decoupling value and attention score updates than prior methods that gate on input embeddings. This creates a direct regulatory pathway that allows the model to suppress a token's contribution based on its emergent value representation. Our experiments demonstrate that VGA significantly mitigates the formation of attention sinks and stabilizes value-state norms, leading to improved performance, robust quantization fidelity, and enhanced model interpretability.
