Information Entropy Invariance: Enhancing Length Extrapolation in Attention Mechanisms
Kewei Li, Yanwen Kong, Yiping Xu, Jianlin Su, Lan Huang, Ruochi Zhang, Fengfeng Zhou
TL;DR
This work tackles the challenge of extending effective context length in attention mechanisms by introducing information-entropy invariance as a guiding principle. It proposes two scale temperatures: InfoScale, a training-free adjustment for dot-product attention that preserves entropy across longer sequences, and CosScale, a theory-backed scaling for cosine attention that concentrates angular attention and can emulate windowed attention patterns as scale grows. Theoretical analyses yield explicit forms and theorems relating CosScale to attention locality, while extensive experiments on GAU-α with long context demonstrate that combining InfoScale and CosScale yields state-of-the-art length extrapolation, outperforming RoPE-based, bias, windowed, and skip-training baselines. The results highlight attention score dilution as a key hurdle for long-range context and show that entropy-preserving and angle-focused scaling approaches can substantially improve practical long-context modeling, with code and data available online.
Abstract
Since the emergence of research on improving the length extrapolation capabilities of large language models in 2021, some studies have made modifications to the scaling factor in the scaled dot-product attention mechanism as part of their proposed methods without rigorous theoretical justifications. To fill this gap, we propose two new scaled temperatures based on information entropy invariance to enhance length extrapolation. First, a training-free method InfoScale is designed for dotproduct attention, and preserves focus on original tokens during length extrapolation by ensuring consistent entropy. Second, we theoretically analyze the impact of scaling (CosScale) on cosine attention. Experimental data demonstrates that combining InfoScale and CosScale achieves state-ofthe-art performance on the GAU-α model with a context window extended to 64 times the training length, and outperforms seven existing methods. Our analysis reveals that significantly increasing CosScale approximates the Windowed Attention, and highlights the significance of attention score dilution as a key challenge in long-range context handling. The code and data are available at https://github.com/HT-NEKO/ Information-Entropy-Invariance.
