Table of Contents
Fetching ...

Position Information Emerges in Causal Transformers Without Positional Encodings via Similarity of Nearby Embeddings

Chunsheng Zuo, Pavel Guerzhoy, Michael Guerzhoy

TL;DR

This paper shows that positional information can emerge in causal Transformers without positional encodings via an adjacency pattern in embeddings, observable through a self-cosine-similarity matrix. The authors define an adjacency probability score and demonstrate, both empirically and theoretically, that nearby embeddings become more similar even in random initialization, with the first causal attention layer producing strong position signals. Probing experiments reveal cosine similarity as a robust indicator of position, outperforming embedding variance as a predictor. The findings suggest an implicit, relative-position representation that persists across models, tasks, and configurations, offering new insights into how Transformer architectures encode order without explicit positional encodings. These results have implications for designing and understanding decoder-only models and their necessity (or lack) for positional encodings in various settings.

Abstract

Transformers with causal attention can solve tasks that require positional information without using positional encodings. In this work, we propose and investigate a new hypothesis about how positional information can be stored without using explicit positional encoding. We observe that nearby embeddings are more similar to each other than faraway embeddings, allowing the transformer to potentially reconstruct the positions of tokens. We show that this pattern can occur in both the trained and the randomly initialized Transformer models with causal attention and no positional encodings over a common range of hyperparameters.

Position Information Emerges in Causal Transformers Without Positional Encodings via Similarity of Nearby Embeddings

TL;DR

This paper shows that positional information can emerge in causal Transformers without positional encodings via an adjacency pattern in embeddings, observable through a self-cosine-similarity matrix. The authors define an adjacency probability score and demonstrate, both empirically and theoretically, that nearby embeddings become more similar even in random initialization, with the first causal attention layer producing strong position signals. Probing experiments reveal cosine similarity as a robust indicator of position, outperforming embedding variance as a predictor. The findings suggest an implicit, relative-position representation that persists across models, tasks, and configurations, offering new insights into how Transformer architectures encode order without explicit positional encodings. These results have implications for designing and understanding decoder-only models and their necessity (or lack) for positional encodings in various settings.

Abstract

Transformers with causal attention can solve tasks that require positional information without using positional encodings. In this work, we propose and investigate a new hypothesis about how positional information can be stored without using explicit positional encoding. We observe that nearby embeddings are more similar to each other than faraway embeddings, allowing the transformer to potentially reconstruct the positions of tokens. We show that this pattern can occur in both the trained and the randomly initialized Transformer models with causal attention and no positional encodings over a common range of hyperparameters.
Paper Structure (25 sections, 3 equations, 13 figures, 6 tables)

This paper contains 25 sections, 3 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Self-cosine-similarity matrices of randomly initialized (first row) and trained (second row) 6-layer Transformers with causal attention and no positional encodings on the task of Reversal (22). The matrices are produced using a testing sample of 22 tokens, "rev(8502251258017069)=", as input, showing results from the embeddings to the output of layer 6 left to right for the initialized and trained models. The number in the bracket represents the adjacency probability score.
  • Figure 2: Histograms on the differences between the cosine similarity of nearby tokens and further ones. Images in the first and the second row are for $sim(a, b) - sim(a, c)$, and $sim(c, b) - sim(c, a)$, respectively.
  • Figure 3: The layer-wise adjacency probability score for randomly initialized and trained models averaged over the 4 tasks, correspond to the values presented in Table \ref{['tab:performance']}.
  • Figure 4: Average layer-wise probing results for trained Causal-NoPE Transformers of (a) Pearson-R and (b) Normalized Root Mean Squared Error (NRMSE) using one of the following as the input: the output vector embeddings $X$, their variance $Var_X$, and the cosine similarity between the output vector embeddings and the vector at the last position $Sim_X$.
  • Figure 5: Self-cosine-similarity matrices of randomly initialized (first row) and trained (second row) 6-layer Transformers with normal attention and learned absolute positional encodings on the task of Indexing (20). The matrices are produced using a testing sample of 20 tokens, "wherex(299517340,9)=", as input.
  • ...and 8 more figures