Scale-invariant Attention
Ben Anson, Xi Wang, Laurence Aitchison
TL;DR
This work tackles the challenge of generalizing attention to longer contexts in Transformers by introducing two scale-invariant desiderata: scale-invariant total attention and scale-invariant attention sparsity. It derives a simple, position-dependent logit transformation under a Gaussian-logit model that provably achieves these properties and integrates it with p-RoPE. Empirically, scale-invariant attention improves long-context language modeling and zero-shot generalization to longer contexts, and maintains robust long-context retrieval on needle-in-a-haystack tasks, outperforming several baselines. Limitations include evaluations at relatively small scale (162M/304M) and reliance on Gaussian-logit assumptions, with promising indications for extension to larger models and broader attention variants. The results suggest a practical path to enhanced long-context processing without routing through retrieval or windowing mechanisms alone.
Abstract
One persistent challenge in LLM research is the development of attention mechanisms that are able to generalise from training on shorter contexts to inference on longer contexts. We propose two conditions that we expect all effective long context attention mechanisms to have: scale-invariant total attention, and scale-invariant attention sparsity. Under a Gaussian assumption, we show that a simple position-dependent transformation of the attention logits is sufficient for these conditions to hold. Experimentally we find that the resulting scale-invariant attention scheme gives considerable benefits in terms of validation loss when zero-shot generalising from training on short contexts to validation on longer contexts, and is effective at long-context retrieval.
