Position Information Emerges in Causal Transformers Without Positional Encodings via Similarity of Nearby Embeddings
Chunsheng Zuo, Pavel Guerzhoy, Michael Guerzhoy
TL;DR
This paper shows that positional information can emerge in causal Transformers without positional encodings via an adjacency pattern in embeddings, observable through a self-cosine-similarity matrix. The authors define an adjacency probability score and demonstrate, both empirically and theoretically, that nearby embeddings become more similar even in random initialization, with the first causal attention layer producing strong position signals. Probing experiments reveal cosine similarity as a robust indicator of position, outperforming embedding variance as a predictor. The findings suggest an implicit, relative-position representation that persists across models, tasks, and configurations, offering new insights into how Transformer architectures encode order without explicit positional encodings. These results have implications for designing and understanding decoder-only models and their necessity (or lack) for positional encodings in various settings.
Abstract
Transformers with causal attention can solve tasks that require positional information without using positional encodings. In this work, we propose and investigate a new hypothesis about how positional information can be stored without using explicit positional encoding. We observe that nearby embeddings are more similar to each other than faraway embeddings, allowing the transformer to potentially reconstruct the positions of tokens. We show that this pattern can occur in both the trained and the randomly initialized Transformer models with causal attention and no positional encodings over a common range of hyperparameters.
