Length Generalization of Causal Transformers without Position Encoding

Jie Wang; Tao Ji; Yuanbin Wu; Hang Yan; Tao Gui; Qi Zhang; Xuanjing Huang; Xiaoling Wang

Length Generalization of Causal Transformers without Position Encoding

Jie Wang, Tao Ji, Yuanbin Wu, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang, Xiaoling Wang

TL;DR

This work investigates how Transformer models without explicit position encodings (NoPE) generalize to longer contexts. It shows NoPE can extend beyond training length more effectively than RoPE, but still exhibits a finite limit linked to attention distribution distraction. The authors propose a parameter-efficient head-based attention scaling (HeadScale), along with initialization and focus constraints, to stabilize and extend length generalization, achieving competitive results on long-sequence language modeling, synthetic tasks, and real-world long-context benchmarks. While promising, NoPE's long-context performance remains below the strongest RoPE-based methods in some near-distance metrics, and the study highlights attention dynamics as a key lever for generalization. Overall, the work provides a new direction for long-context modeling by isolating positional encodings and tuning attention behavior rather than relying solely on explicit position features.

Abstract

Generalizing to longer sentences is important for recent Transformer-based language models. Besides algorithms manipulating explicit position features, the success of Transformers without position encodings (NoPE) provides a new way to overcome the challenge. In this paper, we study the length generalization property of NoPE. We find that although NoPE can extend to longer sequences than the commonly used explicit position encodings, it still has a limited context length. We identify a connection between the failure of NoPE's generalization and the distraction of attention distributions. We propose a parameter-efficient tuning for searching attention heads' best temperature hyper-parameters, which substantially expands NoPE's context size. Experiments on long sequence language modeling, the synthetic passkey retrieval task and real-world long context tasks show that NoPE can achieve competitive performances with state-of-the-art length generalization algorithms. The source code is publicly accessible

Length Generalization of Causal Transformers without Position Encoding

TL;DR

Abstract

Paper Structure (28 sections, 6 equations, 12 figures, 4 tables)

This paper contains 28 sections, 6 equations, 12 figures, 4 tables.

Introduction
Length Generalization of NoPE
Language Modeling with NoPE
Extension? Attention!
Uniform Attention Scale
Head-based Attention Scale
Visual Analysis
Head-based Scale
Initializing HeadScale
Experiment
NoPE pre-trained model
Long Sequence Language Modeling
Settings.
Main results.
Synthetic Long Context Tasks
...and 13 more sections

Figures (12)

Figure 1: Length generalization from $2$K to $4$K. For different testing lengths (or, positions of sequences), dashed lines draw the log-perplexity of models (measured on validation set of the pre-training dataset), and solid lines represent the entropy of attention heads (averaged on all heads).
Figure 2: UniformScale modifies the temperature hyper-parameter of the $\mathrm{SoftMax}$ operator in self-attention layers (Left, NoPE; Right, RoPE). NoPE can generalize to longer context by merely scaling the softmax scores. However, this exact technique does not directly apply to RoPE models.
Figure 3: The attention entropy across all heads for the original NoPE, head-based scaled NoPE and uniform-scaled NoPE, with each model represented in a separate row. The attention heads exhibit divergent patterns.
Figure 4: Comparing uniform and head-based scale (denoted as $\lambda^{(h)}$). UniformScale fails eventually as the perplexity increases with longer sequences. HeadScale is capable of handling much longer context by assigning different scale factors to each attention head.
Figure 5: Correlation analysis for head-based scale when extended to 8K context. The analysis was conducted on the converged entropy values at 8K position, in relation to the scale searched. Each data point represents a unique attention head.
...and 7 more figures

Length Generalization of Causal Transformers without Position Encoding

TL;DR

Abstract

Length Generalization of Causal Transformers without Position Encoding

Authors

TL;DR

Abstract

Table of Contents

Figures (12)