Breaking Symmetry When Training Transformers

Chunsheng Zuo; Michael Guerzhoy

Breaking Symmetry When Training Transformers

Chunsheng Zuo, Michael Guerzhoy

TL;DR

This work probes how Transformers encode positional information without positional encodings by isolating the role of causal attention. It shows that, in the absence of causal masking, next-token predictions are permutation-invariant with respect to earlier tokens, underscoring the need for causality to capture order. Through ablation on residual connections and analysis of activation correlations on a three-digit addition task, the study suggests residuals improve convergence and may influence how positional information is distributed across the network, though definitive evidence of explicit positional storage is not established. These findings inform design choices for no-PE Transformer regimes and shed light on how order information can emerge in deep architectures.

Abstract

As we show in this paper, the prediction for output token $n+1$ of Transformer architectures without one of the mechanisms of positional encodings and causal attention is invariant to permutations of input tokens $1, 2, ..., n-1$. Usually, both mechanisms are employed and the symmetry with respect to the input tokens is broken. Recently, it has been shown that one can train Transformers without positional encodings. This must be enabled by the causal attention mechanism. In this paper, we elaborate on the argument that the causal connection mechanism must be responsible for the fact that Transformers are able to model input sequences where the order is important. Vertical "slices" of Transformers are all encouraged to represent the same location $k$ in the input sequence. We hypothesize that residual connections contribute to this phenomenon, and demonstrate evidence for this.

Breaking Symmetry When Training Transformers

TL;DR

Abstract

As we show in this paper, the prediction for output token

of Transformer architectures without one of the mechanisms of positional encodings and causal attention is invariant to permutations of input tokens

. Usually, both mechanisms are employed and the symmetry with respect to the input tokens is broken. Recently, it has been shown that one can train Transformers without positional encodings. This must be enabled by the causal attention mechanism. In this paper, we elaborate on the argument that the causal connection mechanism must be responsible for the fact that Transformers are able to model input sequences where the order is important. Vertical "slices" of Transformers are all encouraged to represent the same location

in the input sequence. We hypothesize that residual connections contribute to this phenomenon, and demonstrate evidence for this.

Paper Structure (9 sections, 5 equations, 5 figures, 2 tables)

This paper contains 9 sections, 5 equations, 5 figures, 2 tables.

Introduction
Background
Attention
Residual connections
The 3-digit addition task
Next-token predictions using "non-causal attention" are invariant to input permutations
Some residual connections seem necessary for Transformers to converge
Correlations between activations
Conclusions

Figures (5)

Figure 1: "Non-causal" attention matrix (a), masked attention (b), outputs in an intermediate layer of a transformer computed using masked/causal attention
Figure 2: "Non-causal" attention (a) and causal/masked attention (b)
Figure 3: Absolute value of the correlation matrices for output embeddings from layer 1 of NoPE models with residual connections removed at blocks {} (a) {0} (b) {0,1} (c), and {0,1} with a different random initialization (d). Typical results. Note the fact that there are more off-diagonal and off-block-diagonal large values without residual connections. More results in Figs. \ref{['fig:matrices1']}\ref{['fig:matrices2']}.
Figure 4: Absolute value of the correlation matrices for output embeddings from layer 1 (a), 3 (b), and 6 (c) of NoPE models with no residual connections removed.
Figure 5: Absolute value of the correlation matrices for output embeddings from layer 1 (a), 3 (b), and 6 (c) of NoPE models with residual connections removed at layer 0,1.

Breaking Symmetry When Training Transformers

TL;DR

Abstract

Breaking Symmetry When Training Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (5)