Table of Contents
Fetching ...

The Order Is The Message

Jordan LeDoux

Abstract

In a controlled experiment on modular arithmetic ($p = 9973$), varying only example ordering while holding all else constant, two fixed-ordering strategies achieve 99.5\% test accuracy by epochs 487 and 659 respectively from a training set comprising 0.3\% of the input space, well below established sample complexity lower bounds for this task under IID ordering. The IID baseline achieves 0.30\% after 5{,}000 epochs from identical data. An adversarially structured ordering suppresses learning entirely. The generalizing model reliably constructs a Fourier representation whose fundamental frequency is the Fourier dual of the ordering structure, encoding information present in no individual training example, with the same fundamental emerging across all seeds tested regardless of initialization or training set composition. We discuss implications for training efficiency, the reinterpretation of grokking, and the safety risks of a channel that evades all content-level auditing.

The Order Is The Message

Abstract

In a controlled experiment on modular arithmetic (), varying only example ordering while holding all else constant, two fixed-ordering strategies achieve 99.5\% test accuracy by epochs 487 and 659 respectively from a training set comprising 0.3\% of the input space, well below established sample complexity lower bounds for this task under IID ordering. The IID baseline achieves 0.30\% after 5{,}000 epochs from identical data. An adversarially structured ordering suppresses learning entirely. The generalizing model reliably constructs a Fourier representation whose fundamental frequency is the Fourier dual of the ordering structure, encoding information present in no individual training example, with the same fundamental emerging across all seeds tested regardless of initialization or training set composition. We discuss implications for training efficiency, the reinterpretation of grokking, and the safety risks of a channel that evades all content-level auditing.

Paper Structure

This paper contains 133 sections, 10 equations, 12 figures, 23 tables.

Figures (12)

  • Figure 1: The Stride and Fixed-Random strategies generalize quickly; the Random strategy memorizes but is unable to generalize within the compute budget; the Target strategy fails to either generalize or memorize, never improving beyond chance-level performance.
  • Figure 2: Both the Stride and Fixed-Random strategies accumulate frequency power early in training, and both end training with a similar amount of power concentrated in significant frequencies. The Stride strategy organizes much earlier than the Fixed-Random strategy, but the Fixed-Random achieve a higher peak concentration of frequency power.
  • Figure 3: The Embedding Spectral Entropy (blue), Decoder Spectral Entropy (orange), and Neuron Spectral Entropy (green) for all four strategies. This measures the amount of uniformity in weight distributions, with 1.0 being perfectly uniform, and 0.0 being maximally concentrated.
  • Figure 4: Validation accuracy plotted against embedding, decoder, and neuron spectral entropy for the two generalizing strategies. Both follow nearly identical curves despite building different frequency bases: generalization is a function of spectral concentration itself, independent of which frequencies the model concentrates into.
  • Figure 5: L2 norm of the ordering (blue) and content (orange) gradient components across training for Fixed-Random, Random, and Stride. Under the fixed-ordering strategies, the ordering norm peaks early and then declines as the model absorbs the coherent signal. Under Random, the ordering norm remains persistently large but incoherent, at approximately $2.8\times$ the content norm throughout training.
  • ...and 7 more figures