Table of Contents
Fetching ...

Transformers Can Learn Rules They've Never Seen: Proof of Computation Beyond Interpolation

Andy Gray

Abstract

A central question in the LLM debate is whether transformers can infer rules absent from training, or whether apparent generalisation reduces to similarity-based interpolation over observed examples. We test a strong interpolation-only hypothesis in two controlled settings: one where interpolation is ruled out by construction and proof, and one where success requires emitting intermediate symbolic derivations rather than only final answers. In Experiment 1, we use a cellular automaton with a pure XOR transition rule and remove specific local input patterns from training; since XOR is linearly inseparable, each held-out pattern's nearest neighbours have the opposite label, so similarity-based predictors fail on the held-out region. Yet a two-layer transformer recovers the rule (best 100%; 47/60 converged runs), and circuit extraction identifies XOR computation. Performance depends on multi-step constraint propagation: without unrolling, accuracy matches output bias (63.1%), while soft unrolling reaches 96.7%. In Experiment 2, we study symbolic operator chains over integers with one operator pair held out; the model must emit intermediate steps and a final answer in a proof-like format. Across all 49 holdout pairs, the transformer exceeds every interpolation baseline (mean 41.8%, up to 78.6%; mean KRR 4.3%; KNN and MLP score 0% on every pair), while removing intermediate-step supervision degrades performance. Together with a construction showing that a standard transformer block can implement exact local Boolean rules, these results provide an existence proof that transformers can learn rule structure not directly observed in training and express it explicitly, ruling out the strongest architectural form of interpolation-only accounts: that transformers cannot in principle discover and communicate unseen rules, while leaving open when such behaviour arises in large-scale language training.

Transformers Can Learn Rules They've Never Seen: Proof of Computation Beyond Interpolation

Abstract

A central question in the LLM debate is whether transformers can infer rules absent from training, or whether apparent generalisation reduces to similarity-based interpolation over observed examples. We test a strong interpolation-only hypothesis in two controlled settings: one where interpolation is ruled out by construction and proof, and one where success requires emitting intermediate symbolic derivations rather than only final answers. In Experiment 1, we use a cellular automaton with a pure XOR transition rule and remove specific local input patterns from training; since XOR is linearly inseparable, each held-out pattern's nearest neighbours have the opposite label, so similarity-based predictors fail on the held-out region. Yet a two-layer transformer recovers the rule (best 100%; 47/60 converged runs), and circuit extraction identifies XOR computation. Performance depends on multi-step constraint propagation: without unrolling, accuracy matches output bias (63.1%), while soft unrolling reaches 96.7%. In Experiment 2, we study symbolic operator chains over integers with one operator pair held out; the model must emit intermediate steps and a final answer in a proof-like format. Across all 49 holdout pairs, the transformer exceeds every interpolation baseline (mean 41.8%, up to 78.6%; mean KRR 4.3%; KNN and MLP score 0% on every pair), while removing intermediate-step supervision degrades performance. Together with a construction showing that a standard transformer block can implement exact local Boolean rules, these results provide an existence proof that transformers can learn rule structure not directly observed in training and express it explicitly, ruling out the strongest architectural form of interpolation-only accounts: that transformers cannot in principle discover and communicate unseen rules, while leaving open when such behaviour arises in large-scale language training.
Paper Structure (37 sections, 15 theorems, 63 equations, 6 figures, 13 tables)

This paper contains 37 sections, 15 theorems, 63 equations, 6 figures, 13 tables.

Key Result

Theorem 1

Let $n=3$. Any classifier $\hat{y}(p) = \mathrm{sign}(\sum_{q \in T} w(d_H(p,q)) \, y(q))$ with nonneg. weights $w(1) \geq w(2) \geq w(3) \geq 0$ predicts $-y(p)$ or ties. In particular, $k$-NN majority vote fails for every odd $k \in \{1,3,5,7\}$.

Figures (6)

  • Figure 1: Experimental setup. A CA evolves from a random initial state ($t=0$). At each subsequent timestep, positions where hidden patterns occur (red) receive no supervision. Visible positions (green) provide training signal. Wrong hidden-pattern predictions at $t+1$ cascade into errors at visible positions at $t+2$, providing indirect gradient signal.
  • Figure 2: Interpolation accuracy across five representation levels for Rule 150. Input space: provably 0% (Results \ref{['thm:monotone']}--\ref{['lem:rf']}). Neighbourhood embeddings: empirically 0%, with a fixed-position proof extension in Appendix \ref{['app:embedding_extension']}. After transformer layers: 94--100%. The function is constructed de novo by computation.
  • Figure 3: Training dynamics for Rule D (left) and Rule 150 pattern 3 (right). Each line is one seed (10 per panel). Holdout accuracy (blue) emerges rapidly once supervised accuracy (orange) exceeds $\sim$85%. For pure XOR (Rule 150), holdout stays near 0% until this threshold; Rule D, which has internal structure permitting partial interpolation, can rise earlier. Rule 150 ($k$=1, radius 1) requires more epochs and fewer seeds converge, consistent with sparser constraints producing weaker indirect signal.
  • Figure 4: Causal substitution test for the symbolic benchmark. (a) Surgical input replacement: changing $(c,d)$ to values consistent with a different operator flips op2 while preserving op1; changing $(a,b)$ flips op1 while preserving op2. (b) Systematic results over 200 base examples per target operator confirm a clean double dissociation.
  • Figure 5: Mean holdout accuracy vs. fraction of hidden patterns for Rule D and Rule G. Rule D shows a sharp cliff at $\sim$50% hidden; Rule G degrades gradually, tolerating 88% hidden. The frontier depends on intrinsic rule complexity.
  • ...and 1 more figures

Theorems & Definitions (33)

  • Theorem 1: Rule 150 / $n=3$: all monotonic similarity interpolation fails
  • Theorem 2: General $n$: completely monotone kernels give 0%
  • Corollary 3
  • Theorem 4: GP / kernel ridge regression with RBF kernel
  • Lemma 5: Decision trees and Random Forests
  • proof : Proof of Theorem \ref{['thm:monotone']}
  • proof : Proof of Theorem \ref{['thm:cm']}
  • proof : Proof of Corollary \ref{['cor:rbf']}
  • proof : Proof of Theorem \ref{['thm:gp']}
  • proof : Proof of Lemma \ref{['lem:rf']}
  • ...and 23 more