Table of Contents
Fetching ...

Transformers converge to invariant algorithmic cores

Joshua S. Schiffman

TL;DR

Low-dimensional invariants that persist across training runs and scales are revealed, suggesting that transformer computations are organized around compact, shared algorithmic structures, rather than implementation-specific details.

Abstract

Large language models exhibit sophisticated capabilities, yet understanding how they work internally remains a central challenge. A fundamental obstacle is that training selects for behavior, not circuitry, so many weight configurations can implement the same function. Which internal structures reflect the computation, and which are accidents of a particular training run? This work extracts algorithmic cores: compact subspaces necessary and sufficient for task performance. Independently trained transformers learn different weights but converge to the same cores. Markov-chain transformers embed 3D cores in nearly orthogonal subspaces yet recover identical transition spectra. Modular-addition transformers discover compact cyclic operators at grokking that later inflate, yielding a predictive model of the memorization-to-generalization transition. GPT-2 language models govern subject-verb agreement through a single axis that, when flipped, inverts grammatical number throughout generation across scales. These results reveal low-dimensional invariants that persist across training runs and scales, suggesting that transformer computations are organized around compact, shared algorithmic structures. Mechanistic interpretability could benefit from targeting such invariants -- the computational essence -- rather than implementation-specific details.

Transformers converge to invariant algorithmic cores

TL;DR

Low-dimensional invariants that persist across training runs and scales are revealed, suggesting that transformer computations are organized around compact, shared algorithmic structures, rather than implementation-specific details.

Abstract

Large language models exhibit sophisticated capabilities, yet understanding how they work internally remains a central challenge. A fundamental obstacle is that training selects for behavior, not circuitry, so many weight configurations can implement the same function. Which internal structures reflect the computation, and which are accidents of a particular training run? This work extracts algorithmic cores: compact subspaces necessary and sufficient for task performance. Independently trained transformers learn different weights but converge to the same cores. Markov-chain transformers embed 3D cores in nearly orthogonal subspaces yet recover identical transition spectra. Modular-addition transformers discover compact cyclic operators at grokking that later inflate, yielding a predictive model of the memorization-to-generalization transition. GPT-2 language models govern subject-verb agreement through a single axis that, when flipped, inverts grammatical number throughout generation across scales. These results reveal low-dimensional invariants that persist across training runs and scales, suggesting that transformer computations are organized around compact, shared algorithmic structures. Mechanistic interpretability could benefit from targeting such invariants -- the computational essence -- rather than implementation-specific details.
Paper Structure (28 sections, 22 equations, 5 figures, 6 tables)

This paper contains 28 sections, 22 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Transformers trained on the same Markov task converge to a low-dimensional, causal algorithmic core. Three one-layer transformer language models with identical architectures ($d_{\rm model}=64$, $d_{\rm ff}=256$, $|V| = 4$) were initialized with independent random seeds and trained on the same next-token prediction task on sequences generated by a four-state Markov chain, reaching equal test accuracies (Methods). (A) Despite equivalent architectures and training data, learned parameters differed substantially across runs as measured by cosine similarity. (B) From each model's 64D hidden state, a 3D algorithmic core was extracted and its test accuracy assessed under ablations: baseline (control; using full activations $\tilde{\mathbf{h}}=\mathbf{h}$), core-only (core$^+$: $\tilde{\mathbf{h}}=\mathbf{h}\mathbf{P}$, using activations projected onto the core subspace) to evaluate sufficiency, and core-removed (core$^-$: $\tilde{\mathbf{h}}=\mathbf{h}-\mathbf{h}\mathbf{P}$) to evaluate necessity (\ref{['tab:mark_ablations']}). Ablation performance is compared to the Bayes-optimal one-step accuracy $\sum_i \pi_i \max_j T_{ij}$ and the unconditional chance baseline $\max (\mathbold{\pi})$, where $\mathbf{T}$ is the Markov transition probability matrix and $\mathbold{\pi}$ is a vector of its stationary distribution. (C) Although all cores have the same rank and each appears to be necessary (core$^- \approx$ chance) and sufficient (core$^+ \approx$ optimal), their geometric alignment is weak: normalized projector overlap $\mathrm{tr}(\mathbf{P}_i\mathbf{P}_j)/\mathrm{tr}(\mathbf{P}_i)$ is low and principal angles are nearly orthogonal, orienting between $75^\circ$--$90^\circ$ (\ref{['tab:mark_cca']}). (D) In contrast, cores exhibit strong statistical similarity: canonical correlation analysis (CCA) yields near-unity mean canonical correlations across core dimensions (also see \ref{['tab:mark_cca']}). (E) After mapping each core into a shared "canonical" coordinate system (rank $=3$), core ablations remain necessary and sufficient; by comparison, full "consensus" activation alignments yield subspaces (rank $=48$) that are sufficient (keep $\approx$ baseline) but not necessary (remove $\gg$ chance). (F) Linear dynamics fit in core coordinates recover the Markov chain's non-trivial spectrum: the inferred eigenvalues closely match the three eigenvalues of $\mathbf{T}$ (excluding the Perron--Frobenius eigenvalue), suggesting the core routes the learned task dynamics (also see \ref{['tab:mark_spec']}). Points in B and E represent individual test accuracies and error bars denote mean $\pm$ s.e.m.
  • Figure 2: Modular addition cores form at grokking and are defined by automatically recoverable rotational operations. Three two-layer transformers with equivalent architectures ($d_{\rm model} = 128$, $d_{\rm ff} = 512$) were initialized with independent random seeds and trained for $2 \times 10^3$ epochs on the same modular addition task, $a+b \equiv c \mod p$, with $a,b \in \{0,\dots, p-1\}$ and $p=53$. (A) Transformer test accuracy (red, mean $\pm$ s.e.m. on left y-axis) vs. training time (epochs) exhibits grokking whereby test accuracy spikes late after training accuracy (not shown). Grokking is concordant with the formation of a modular addition algorithmic core, which compresses in size (gray, mean $\pm$ s.e.m. on right y-axis) prior to grokking. (B) Low-dimensional algorithmic cores from each transformer appear necessary and sufficient under projection-based ablations, maintaining baseline test accuracy if alone (blue) and dropping to near chance accuracy if removed (orange) after grokking. (C) Automated operator fits at selected training epochs reveal the emergence of a cyclic computational mechanism. Early in training (epoch 0--300), eigenvalues scatter inside the unit circle -- the learned transformation appears contractive, not cyclic. At grokking (epoch 800), eigenvalues snap onto the unit circle, indicating the discovery of a cyclic or rotational mechanism. The quality of fit ($R^2_h$) jumps from near-zero to near-unity, suggesting the core has formed into a coherent, compact algorithm.
  • Figure 3: Extended training under weight decay "over-educates" transformers -- cores inflate and operators saturate. Long term training dynamics of transformers that grokked modular addition, under different weight decay (WD) schedules. (A) Although grokking concurs with the emergence of a low-dimensional causal core subspace near epoch 800 (\ref{['fig:mod1']}), under continued training the core subspace dimension increases when weight decay is maintained (black; mean $\pm$ s.e.m. of three transformers). In contrast, when weight decay is disabled after grokking (set from $1$ to $0$), the same transformers (branched from a post-grokking checkpoint) do not exhibit a pattern of core inflation (purple). (B) Core inflation appears to be driven by redundancy, as the number of ranked core subspace dimensions required to maintain test accuracy remains stable (blue), whereas the number of dimensions that need to be removed to drop the model to near chance accuracy increases (orange). Lines depict mean values across models trained with weight decay fixed at $1$ for all epochs. (C) (Left) Linear dynamics fit at the terminal epoch ($2 \times 10^4$) reveal a saturated core operator when weight decay is maintained throughout training in contrast to a more sparsely represented operator when weight decay is removed. (Right) Rotational modes (conjugate eigenvalue pairs) around the unit circle increase with extended training under weight decay, whereas when weight decay is removed, mode counts remain stable.
  • Figure 4: Subject--verb agreement is supported by a shared 1D core across GPT-2 model scales. The core framework was applied to GPT-2 Small (117M parameters; 12 layers), Medium (345M parameters; 24 layers), and Large (774M parameters; 36 layers) to isolate a low-dimensional mechanism facilitating number agreement (grammatically correct choice of is/was vs. are/were). (A) Layer sweep: agreement performance (AUC) as a function of normalized layer depth, averaged across LLMs (lines) with per-model measurements overlaid (markers) and shaded min--max bands. Agreement performance is the probability that the model assigns a higher plural-vs-singular verb-preference score (logit margin) to a plural-prompt than to a singular-prompt ($1.0 =$ perfect, $0.5 =$ chance, $0.0 =$ inverted). (B) Projecting last-token hidden states onto the core produces a nearly linear control axis for the singular--plural logit margin (i.e., verb-preference score); per-model affine fits are shown after $z$ scoring both axes (legend reports per-model $R^2$). (C) Perturbations at selected layer per-model: removing the core degrades number agreement, while flipping the core inverts LLM verb preference. Box plots show the distribution of prompt-level agreement scores under perturbations (x-axis) for each GPT-2 model. Reported p-values combine per-model paired Wilcoxon tests using Fisher's method; all points shown; center lines = median; box = IQR (25th--75th percentiles); whiskers = 1.5$\,\times\,$IQR.
  • Figure 5: Steering the core induces systematic agreement violations in open-ended text generation. Each panel shows text generated by GPT-2 from the same prompt under two conditions: Base (unmodified model) and Core Steering (activations adaptively reflected through the 1D agreement core at each token). Colored words highlight selected agreement violations. Core steering reliably inverts number preferences: singular subjects acquire plural verbs and contexts expecting plural forms shift toward singular. Effects generalize across verb types, syntactic positions, and model scales.