Table of Contents
Fetching ...

Value-State Gated Attention for Mitigating Extreme-Token Phenomena in Transformers

Rui Bu, Haofeng Zhong, Wenzheng Chen, Yangyan Li

TL;DR

This paper tackles extreme-token phenomena in Transformers by diagnosing a mutual reinforcement cycle between attention sinks and value-state drains. It introduces Value-State Gated Attention (VGA), a lightweight, reactive gate derived from value states that directly modulates token contributions to attention outputs, breaking the pathological feedback loop. The authors show analytically that VGA severs gradient flow to offending value states when sinks form, and they validate the approach with synthetic and real-model experiments, demonstrating improved stability, interpretability, and post-training quantization robustness. VGA consistently outperforms baselines across a range of models and tasks, and can retrofit existing models with substantial stability gains at low additional compute, making it a practical enhancement for future large-scale transformers.

Abstract

Large models based on the Transformer architecture are susceptible to extreme-token phenomena, such as attention sinks and value-state drains. These issues, which degrade model performance, quantization fidelity, and interpretability, arise from a problematic mutual reinforcement mechanism where the model learns an inefficient 'no-op' behavior by focusing attention on tokens with near-zero value states. In this paper, we propose Value-State Gated Attention (VGA), a simple, dedicated, and stable architectural mechanism for performing 'no-op' attention efficiently by directly breaking this cycle. VGA introduces a learnable, data-dependent gate, computed directly from the value vectors (V), to modulate the output. Through a theoretical analysis of the underlying gradients, we show that gating the value-state with a function of itself is more effective at decoupling value and attention score updates than prior methods that gate on input embeddings. This creates a direct regulatory pathway that allows the model to suppress a token's contribution based on its emergent value representation. Our experiments demonstrate that VGA significantly mitigates the formation of attention sinks and stabilizes value-state norms, leading to improved performance, robust quantization fidelity, and enhanced model interpretability.

Value-State Gated Attention for Mitigating Extreme-Token Phenomena in Transformers

TL;DR

This paper tackles extreme-token phenomena in Transformers by diagnosing a mutual reinforcement cycle between attention sinks and value-state drains. It introduces Value-State Gated Attention (VGA), a lightweight, reactive gate derived from value states that directly modulates token contributions to attention outputs, breaking the pathological feedback loop. The authors show analytically that VGA severs gradient flow to offending value states when sinks form, and they validate the approach with synthetic and real-model experiments, demonstrating improved stability, interpretability, and post-training quantization robustness. VGA consistently outperforms baselines across a range of models and tasks, and can retrofit existing models with substantial stability gains at low additional compute, making it a practical enhancement for future large-scale transformers.

Abstract

Large models based on the Transformer architecture are susceptible to extreme-token phenomena, such as attention sinks and value-state drains. These issues, which degrade model performance, quantization fidelity, and interpretability, arise from a problematic mutual reinforcement mechanism where the model learns an inefficient 'no-op' behavior by focusing attention on tokens with near-zero value states. In this paper, we propose Value-State Gated Attention (VGA), a simple, dedicated, and stable architectural mechanism for performing 'no-op' attention efficiently by directly breaking this cycle. VGA introduces a learnable, data-dependent gate, computed directly from the value vectors (V), to modulate the output. Through a theoretical analysis of the underlying gradients, we show that gating the value-state with a function of itself is more effective at decoupling value and attention score updates than prior methods that gate on input embeddings. This creates a direct regulatory pathway that allows the model to suppress a token's contribution based on its emergent value representation. Our experiments demonstrate that VGA significantly mitigates the formation of attention sinks and stabilizes value-state norms, leading to improved performance, robust quantization fidelity, and enhanced model interpretability.

Paper Structure

This paper contains 15 sections, 19 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Architecture of Value-State Gated Attention (VGA). Unlike vanilla attention or input-state gated attention, VGA introduces a value-state gating mechanism to modulate the attention output.
  • Figure 2: The mutual reinforcement cycle that leads to attention sinks and value-state drains. (Left) Initial state with natural attention weights and moderate value state norms. (Right) The cycle begins when a query allocates high attention to a sink token $s$. This amplifies the gradient backpropagated to $V_s$, prompting the optimizer to suppress its norm via learning rate $\eta$, resulting in a value-state drain. This suppression makes the token an even safer target for future 'no-op' queries, locking it into the sink role.
  • Figure 3: VGA alters gradient dynamics by learning to close the gate for an attention sink ($g_s \to 0$). This action severs the gradient flow to the value state $V_s$, effectively breaking the cycle.
  • Figure 4: Comparative analysis of a vanilla Transformer (left), a IGA mode(middle), and a VGA model (right) on the Bigram-Backcopy guo2024active task. (a) VGA prevents the formation of an attention sink on the <$s$> token. (b) Consequently, VGA resolves the corresponding value-state drain, preserving the norm of the sink token's value vector.
  • Figure 5: Comparison of training dynamics on the Bigram-Backcopy task. Performance and sink-token metrics are tracked over training steps for three models: the vanilla Transformer (left), the IGA model (middle), and the VGA model (right).