On the Limitations and Capabilities of Position Embeddings for Length Generalization
Yang Chen, Yitao Liang, Zhouchen Lin
TL;DR
This work addresses how position embeddings (PEs) influence length generalization (LG) in Transformer models. It introduces a theory-grounded analysis starting from Position-Only Linear Attention (POLA) and Linear Representation Complexity (LRC), then extends to practical Transformers via Sequential Representation Complexity (SRC) and two strategies—Scale Hint (SH) and Learning-Based Position Embeddings (LBPE)—with empirical validation. Key findings show PEs cannot create new operators required for longer inputs when SRC increases, but can facilitate LG when SRC is invariant and the PRF correctly identifies operator roles; SH and LBPE broaden LG applicability and reduce task-specific design burdens. The results provide a theoretical boundary and practical tools for leveraging PEs to improve LG in reasoning tasks and scalable sequence modeling.
Abstract
In Transformers, Position Embeddings (PEs) significantly influence Length Generalization (LG) performance, yet their fundamental role remains unclear. In this work, we investigate the limitations and capabilities of PEs in achieving LG. We theoretically analyze PEs in Position-Only Linear Attentions (POLAs), introducing Linear Representation Complexity (LRC) to characterize when PEs enable LG. Our analysis shows that PEs do not expand computational capabilities but structure learned computations across positions. Extending to practical Transformers, we propose Sequential Representation Complexity (SRC) and conjecture that LG is possible if and only if SRC remains invariant across scales. We support this hypothesis with empirical evidence in various reasoning tasks. To enhance LG, we introduce Scale Hint, allowing flexible instance scaling, and a Learning-Based Position Embedding framework that automatically learns positional relations. Our work provides theoretical insights and practical strategies for improving LG in Transformers.
