Table of Contents
Fetching ...

On the Limitations and Capabilities of Position Embeddings for Length Generalization

Yang Chen, Yitao Liang, Zhouchen Lin

TL;DR

This work addresses how position embeddings (PEs) influence length generalization (LG) in Transformer models. It introduces a theory-grounded analysis starting from Position-Only Linear Attention (POLA) and Linear Representation Complexity (LRC), then extends to practical Transformers via Sequential Representation Complexity (SRC) and two strategies—Scale Hint (SH) and Learning-Based Position Embeddings (LBPE)—with empirical validation. Key findings show PEs cannot create new operators required for longer inputs when SRC increases, but can facilitate LG when SRC is invariant and the PRF correctly identifies operator roles; SH and LBPE broaden LG applicability and reduce task-specific design burdens. The results provide a theoretical boundary and practical tools for leveraging PEs to improve LG in reasoning tasks and scalable sequence modeling.

Abstract

In Transformers, Position Embeddings (PEs) significantly influence Length Generalization (LG) performance, yet their fundamental role remains unclear. In this work, we investigate the limitations and capabilities of PEs in achieving LG. We theoretically analyze PEs in Position-Only Linear Attentions (POLAs), introducing Linear Representation Complexity (LRC) to characterize when PEs enable LG. Our analysis shows that PEs do not expand computational capabilities but structure learned computations across positions. Extending to practical Transformers, we propose Sequential Representation Complexity (SRC) and conjecture that LG is possible if and only if SRC remains invariant across scales. We support this hypothesis with empirical evidence in various reasoning tasks. To enhance LG, we introduce Scale Hint, allowing flexible instance scaling, and a Learning-Based Position Embedding framework that automatically learns positional relations. Our work provides theoretical insights and practical strategies for improving LG in Transformers.

On the Limitations and Capabilities of Position Embeddings for Length Generalization

TL;DR

This work addresses how position embeddings (PEs) influence length generalization (LG) in Transformer models. It introduces a theory-grounded analysis starting from Position-Only Linear Attention (POLA) and Linear Representation Complexity (LRC), then extends to practical Transformers via Sequential Representation Complexity (SRC) and two strategies—Scale Hint (SH) and Learning-Based Position Embeddings (LBPE)—with empirical validation. Key findings show PEs cannot create new operators required for longer inputs when SRC increases, but can facilitate LG when SRC is invariant and the PRF correctly identifies operator roles; SH and LBPE broaden LG applicability and reduce task-specific design burdens. The results provide a theoretical boundary and practical tools for leveraging PEs to improve LG in reasoning tasks and scalable sequence modeling.

Abstract

In Transformers, Position Embeddings (PEs) significantly influence Length Generalization (LG) performance, yet their fundamental role remains unclear. In this work, we investigate the limitations and capabilities of PEs in achieving LG. We theoretically analyze PEs in Position-Only Linear Attentions (POLAs), introducing Linear Representation Complexity (LRC) to characterize when PEs enable LG. Our analysis shows that PEs do not expand computational capabilities but structure learned computations across positions. Extending to practical Transformers, we propose Sequential Representation Complexity (SRC) and conjecture that LG is possible if and only if SRC remains invariant across scales. We support this hypothesis with empirical evidence in various reasoning tasks. To enhance LG, we introduce Scale Hint, allowing flexible instance scaling, and a Learning-Based Position Embedding framework that automatically learns positional relations. Our work provides theoretical insights and practical strategies for improving LG in Transformers.

Paper Structure

This paper contains 25 sections, 7 theorems, 79 equations, 6 figures, 6 tables.

Key Result

theorem 1

Define ${\mathcal{F}}_M:=\{A\in{\mathcal{U}}_N\mid \|A\|_{\infty}\leq M\}$ and ${\mathcal{F}}_{M,B}:=\{A\in {\mathcal{F}}_M\mid A_{[N_0],[N_0]}=B\}$ for all $B\in{\mathcal{U}}_{N_0}$. For any $B_0\in{\mathcal{U}}_{N_0}$ and fixed learning algorithm, let ${\mathcal{F}}_{M,B_0}^{N_0,N}\subseteq{\mathc Then for all $M > 0$, we have and

Figures (6)

  • Figure 1: Different PEs correspond to different methods for computing outputs in the Parity (with CoT) task. IPE (see Section \ref{['sec:practical-tf']}) and RPE align the positions across steps and scales to compute the next token from the corresponding token in the input ($x_1$) and the current token ($x_2$). IPE encodes all other positions into a single value, whereas RPE redundantly encodes them with distinct values. In contrast, APE lacks positional alignment, requiring a distinct operator at each step. When input scales exceed those seen during training, APE necessitates novel operators not learned from data. Under the notation introduced in Section \ref{['subsec:anc']}, both IPE and RPE characterize a circuit of non-increasing SRC that computes the Parity task, while APE does not. As we show, PEs alone cannot introduce novel operators or handle circuits with increasing SRC. Consequently, IPE and RPE succeed in achieving LG, while APE fails to generalize.
  • Figure 2: Evaluation results of models using different PEs across six tasks. Each model is trained on 10,000 samples of scales 1--5 for 300 epochs, with evaluation performed on 1,000 samples at each scale (1--20). Checkpoints are saved every 30 epochs. For each configuration, the plotted curve corresponds to the checkpoint that achieves the best average performance across all scales.
  • Figure 3: Comparison between IPE and IPE-SH in Addition, Multiplication (1 * N), and Division (N / 1). For IPE, we align input samples to scale 20, whereas IPE-SH operates without scale alignment. Both models are trained on samples of scales 1–5 and evaluated on scales 5, 10, 15, 20 (the numbers in the subpations mean the evaluation scales). For clarity, we present only the evaluation results on scales 16--20.
  • Figure 4: Evaluation results of models with LBPE across three different Select tasks (SelectFirst, SelectMiddle, and SelectLast). Each model is trained on 1,000 samples of scales 1--10 for 2,000 epochs and evaluated on 1,000 samples at each scale 1--20. We save checkpoints every 20 epochs. For each configuration, we plot the curve for the checkpoint of the best average performance across all scales.
  • Figure 5: Visualization of the learned PRFs in the three Select tasks. For each task, we show the predicted PRF values corresponding to the top three prediction weights (ranked 1--3 from top to bottom) for each query position $i$ and key position $j$, where $i \leq j$. "QP" and "KP" mean "query position" and "key position", respectively.
  • ...and 1 more figures

Theorems & Definitions (21)

  • definition 1: $(N_0,N)$-Length Generalization
  • definition 2: Position-Only Linear Attention
  • definition 3: Linear Representation Complexity
  • theorem 1
  • remark 1
  • theorem 2
  • definition 4: Circuit Representation of Sequential Computation
  • example 1: Parity (with CoT)
  • example 2: Multiplication (1 * N)
  • example 3
  • ...and 11 more