Table of Contents
Fetching ...

Information Entropy Invariance: Enhancing Length Extrapolation in Attention Mechanisms

Kewei Li, Yanwen Kong, Yiping Xu, Jianlin Su, Lan Huang, Ruochi Zhang, Fengfeng Zhou

TL;DR

This work tackles the challenge of extending effective context length in attention mechanisms by introducing information-entropy invariance as a guiding principle. It proposes two scale temperatures: InfoScale, a training-free adjustment for dot-product attention that preserves entropy across longer sequences, and CosScale, a theory-backed scaling for cosine attention that concentrates angular attention and can emulate windowed attention patterns as scale grows. Theoretical analyses yield explicit forms and theorems relating CosScale to attention locality, while extensive experiments on GAU-α with long context demonstrate that combining InfoScale and CosScale yields state-of-the-art length extrapolation, outperforming RoPE-based, bias, windowed, and skip-training baselines. The results highlight attention score dilution as a key hurdle for long-range context and show that entropy-preserving and angle-focused scaling approaches can substantially improve practical long-context modeling, with code and data available online.

Abstract

Since the emergence of research on improving the length extrapolation capabilities of large language models in 2021, some studies have made modifications to the scaling factor in the scaled dot-product attention mechanism as part of their proposed methods without rigorous theoretical justifications. To fill this gap, we propose two new scaled temperatures based on information entropy invariance to enhance length extrapolation. First, a training-free method InfoScale is designed for dotproduct attention, and preserves focus on original tokens during length extrapolation by ensuring consistent entropy. Second, we theoretically analyze the impact of scaling (CosScale) on cosine attention. Experimental data demonstrates that combining InfoScale and CosScale achieves state-ofthe-art performance on the GAU-α model with a context window extended to 64 times the training length, and outperforms seven existing methods. Our analysis reveals that significantly increasing CosScale approximates the Windowed Attention, and highlights the significance of attention score dilution as a key challenge in long-range context handling. The code and data are available at https://github.com/HT-NEKO/ Information-Entropy-Invariance.

Information Entropy Invariance: Enhancing Length Extrapolation in Attention Mechanisms

TL;DR

This work tackles the challenge of extending effective context length in attention mechanisms by introducing information-entropy invariance as a guiding principle. It proposes two scale temperatures: InfoScale, a training-free adjustment for dot-product attention that preserves entropy across longer sequences, and CosScale, a theory-backed scaling for cosine attention that concentrates angular attention and can emulate windowed attention patterns as scale grows. Theoretical analyses yield explicit forms and theorems relating CosScale to attention locality, while extensive experiments on GAU-α with long context demonstrate that combining InfoScale and CosScale yields state-of-the-art length extrapolation, outperforming RoPE-based, bias, windowed, and skip-training baselines. The results highlight attention score dilution as a key hurdle for long-range context and show that entropy-preserving and angle-focused scaling approaches can substantially improve practical long-context modeling, with code and data available online.

Abstract

Since the emergence of research on improving the length extrapolation capabilities of large language models in 2021, some studies have made modifications to the scaling factor in the scaled dot-product attention mechanism as part of their proposed methods without rigorous theoretical justifications. To fill this gap, we propose two new scaled temperatures based on information entropy invariance to enhance length extrapolation. First, a training-free method InfoScale is designed for dotproduct attention, and preserves focus on original tokens during length extrapolation by ensuring consistent entropy. Second, we theoretically analyze the impact of scaling (CosScale) on cosine attention. Experimental data demonstrates that combining InfoScale and CosScale achieves state-ofthe-art performance on the GAU-α model with a context window extended to 64 times the training length, and outperforms seven existing methods. Our analysis reveals that significantly increasing CosScale approximates the Windowed Attention, and highlights the significance of attention score dilution as a key challenge in long-range context handling. The code and data are available at https://github.com/HT-NEKO/ Information-Entropy-Invariance.
Paper Structure (39 sections, 2 theorems, 44 equations, 7 figures, 10 tables)

This paper contains 39 sections, 2 theorems, 44 equations, 7 figures, 10 tables.

Key Result

Theorem 3.4

The peak value $\eta_{1}^{*}$ of the $QK$ distribution before RoPE shifts toward 1 as CosScale $\alpha$ increases.

Figures (7)

  • Figure 1: Visualization of spectral clustering with stratified 10 samples as test set. (a) Distribution of 2D PCA features for different samples. (b) Clustering results. (c) Sample distribution obtained through stratified random sampling.
  • Figure 2: Training loss of the base GAU-α model with and without CosScale=128. The x-axis represents the epoch.
  • Figure 3: Training loss of all the baseline models integrating InfoScale. The left panel shows the baselines with CosScale, while the right panel shows the baselines without CosScale. All baseline models requiring fine-tuning were fine-tuned for 1000 steps (equal to 100 epochs). The x-axis represents the training steps.
  • Figure 4: The comparison of theoretical $\eta_{1}^{*}$ and experimental $\eta_{1}^{*}$ at different $\alpha$ values according to Eq. \ref{['eq18']}. The x-axis represents $\alpha$ (CosScale), and the y-axis represents the peak value of $\eta_{1}^{*}$.
  • Figure 5: Heatmaps of QK multiplications normalized by global maximum and minimum values before and after RoPE processing(range from 0 to 1) with extending sequence length to 1024. The first row represents the heatmaps of QK multiplication before RoPE at increasing CosScale values (from left to right: 8, 16, 32, 64, 96, 128, 256). And the second row shows the corresponding heatmaps of QK multiplication after RoPE at increasing CosScale values (from left to right: 8, 16, 32, 64, 96, 128, 256). Each pair of two heatmaps in a column shares a consistent color bar between them.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Theorem 3.4
  • Theorem 3.5