Breaking Symmetry When Training Transformers
Chunsheng Zuo, Michael Guerzhoy
TL;DR
This work probes how Transformers encode positional information without positional encodings by isolating the role of causal attention. It shows that, in the absence of causal masking, next-token predictions are permutation-invariant with respect to earlier tokens, underscoring the need for causality to capture order. Through ablation on residual connections and analysis of activation correlations on a three-digit addition task, the study suggests residuals improve convergence and may influence how positional information is distributed across the network, though definitive evidence of explicit positional storage is not established. These findings inform design choices for no-PE Transformer regimes and shed light on how order information can emerge in deep architectures.
Abstract
As we show in this paper, the prediction for output token $n+1$ of Transformer architectures without one of the mechanisms of positional encodings and causal attention is invariant to permutations of input tokens $1, 2, ..., n-1$. Usually, both mechanisms are employed and the symmetry with respect to the input tokens is broken. Recently, it has been shown that one can train Transformers without positional encodings. This must be enabled by the causal attention mechanism. In this paper, we elaborate on the argument that the causal connection mechanism must be responsible for the fact that Transformers are able to model input sequences where the order is important. Vertical "slices" of Transformers are all encouraged to represent the same location $k$ in the input sequence. We hypothesize that residual connections contribute to this phenomenon, and demonstrate evidence for this.
