On Vanishing Variance in Transformer Length Generalization
Ruining Li, Gabrijel Boduljak, Jensen, Zhou
TL;DR
This paper introduces a vanishing-variance perspective on transformer length generalization, showing that as sequence length $N$ increases, the variance of attention outputs decays and induces distribution shift that harms generalization to longer inputs. It combines theoretical reasoning with empirical studies on order-invariant tasks, revealing that applying LayerNorm after attention outputs stabilizes global statistics and improves out-of-distribution performance, though it does not fully eliminate the decay. Ablation studies indicate that normalization—particularly LayerNorm—significantly mitigates length-related degradation, while standardization also helps albeit with less capacity. The work suggests a path toward more robust, length-invariant architectures and highlights the need for architectural design beyond ad-hoc positional encodings to ensure reliable long-sequence reasoning in Transformers.
Abstract
It is a widely known issue that Transformers, when trained on shorter sequences, fail to generalize robustly to longer ones at test time. This raises the question of whether Transformer models are real reasoning engines, despite their impressive abilities in mathematical problem solving and code synthesis. In this paper, we offer a vanishing variance perspective on this issue. To the best of our knowledge, we are the first to demonstrate that even for today's frontier models, a longer sequence length results in a decrease in variance in the output of the multi-head attention modules. On the argmax retrieval and dictionary lookup tasks, our experiments show that applying layer normalization after the attention outputs leads to significantly better length generalization. Our analyses attribute this improvement to a reduction-though not a complete elimination-of the distribution shift caused by vanishing variance.
