Table of Contents
Fetching ...

Breaking Symmetry When Training Transformers

Chunsheng Zuo, Michael Guerzhoy

TL;DR

This work probes how Transformers encode positional information without positional encodings by isolating the role of causal attention. It shows that, in the absence of causal masking, next-token predictions are permutation-invariant with respect to earlier tokens, underscoring the need for causality to capture order. Through ablation on residual connections and analysis of activation correlations on a three-digit addition task, the study suggests residuals improve convergence and may influence how positional information is distributed across the network, though definitive evidence of explicit positional storage is not established. These findings inform design choices for no-PE Transformer regimes and shed light on how order information can emerge in deep architectures.

Abstract

As we show in this paper, the prediction for output token $n+1$ of Transformer architectures without one of the mechanisms of positional encodings and causal attention is invariant to permutations of input tokens $1, 2, ..., n-1$. Usually, both mechanisms are employed and the symmetry with respect to the input tokens is broken. Recently, it has been shown that one can train Transformers without positional encodings. This must be enabled by the causal attention mechanism. In this paper, we elaborate on the argument that the causal connection mechanism must be responsible for the fact that Transformers are able to model input sequences where the order is important. Vertical "slices" of Transformers are all encouraged to represent the same location $k$ in the input sequence. We hypothesize that residual connections contribute to this phenomenon, and demonstrate evidence for this.

Breaking Symmetry When Training Transformers

TL;DR

This work probes how Transformers encode positional information without positional encodings by isolating the role of causal attention. It shows that, in the absence of causal masking, next-token predictions are permutation-invariant with respect to earlier tokens, underscoring the need for causality to capture order. Through ablation on residual connections and analysis of activation correlations on a three-digit addition task, the study suggests residuals improve convergence and may influence how positional information is distributed across the network, though definitive evidence of explicit positional storage is not established. These findings inform design choices for no-PE Transformer regimes and shed light on how order information can emerge in deep architectures.

Abstract

As we show in this paper, the prediction for output token of Transformer architectures without one of the mechanisms of positional encodings and causal attention is invariant to permutations of input tokens . Usually, both mechanisms are employed and the symmetry with respect to the input tokens is broken. Recently, it has been shown that one can train Transformers without positional encodings. This must be enabled by the causal attention mechanism. In this paper, we elaborate on the argument that the causal connection mechanism must be responsible for the fact that Transformers are able to model input sequences where the order is important. Vertical "slices" of Transformers are all encouraged to represent the same location in the input sequence. We hypothesize that residual connections contribute to this phenomenon, and demonstrate evidence for this.
Paper Structure (9 sections, 5 equations, 5 figures, 2 tables)

This paper contains 9 sections, 5 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: "Non-causal" attention matrix (a), masked attention (b), outputs in an intermediate layer of a transformer computed using masked/causal attention
  • Figure 2: "Non-causal" attention (a) and causal/masked attention (b)
  • Figure 3: Absolute value of the correlation matrices for output embeddings from layer 1 of NoPE models with residual connections removed at blocks {} (a) {0} (b) {0,1} (c), and {0,1} with a different random initialization (d). Typical results. Note the fact that there are more off-diagonal and off-block-diagonal large values without residual connections. More results in Figs. \ref{['fig:matrices1']}\ref{['fig:matrices2']}.
  • Figure 4: Absolute value of the correlation matrices for output embeddings from layer 1 (a), 3 (b), and 6 (c) of NoPE models with no residual connections removed.
  • Figure 5: Absolute value of the correlation matrices for output embeddings from layer 1 (a), 3 (b), and 6 (c) of NoPE models with residual connections removed at layer 0,1.