Table of Contents
Fetching ...

Induction Signatures Are Not Enough: A Matched-Compute Study of Load-Bearing Structure in In-Context Learning

Mohammed Sabry, Anya Belz

Abstract

Mechanism-targeted synthetic data is increasingly proposed as a way to steer pretraining toward desirable capabilities, but it remains unclear how such interventions should be evaluated. We study this question for in-context learning (ICL) under matched compute (iso-FLOPs) using Bi-Induct, a lightweight data rewrite that interleaves short directional copy snippets into a natural pretraining stream: forward-copy (induction), backward-copy (anti-induction, as a directional control), or a balanced mix. Across 0.13B-1B decoder-only models, we evaluate (i) few-shot performance on standard LM benchmarks and function-style ICL probes, (ii) head-level copy telemetry, and (iii) held-out perplexity as a guardrail. Bi-Induct reliably increases induction-head activity, but this does not translate into consistent improvements in few-shot generalization: on standard LM benchmarks, Bi-Induct is largely performance-neutral relative to natural-only training, while on function-style probes the 1B natural-only model performs best. Despite explicit backward-copy cues, anti-induction scores remain near zero across scales, revealing a strong forward/backward asymmetry. Targeted ablations show a sharper distinction: removing the top 2% induction heads per layer harms ICL more than matched random ablations, with the largest relative drop occurring in the natural-only models. This indicates that natural-only training produces more centralized, load-bearing induction circuitry, whereas Bi-Induct tends to create more distributed and redundant induction activity. Our main conclusion is that eliciting a mechanism is not the same as making it load-bearing. For data-centric foundation model design, this suggests that synthetic data interventions should be evaluated not only by signature amplification, but by whether they create causally necessary computation while preserving natural-data modeling quality.

Induction Signatures Are Not Enough: A Matched-Compute Study of Load-Bearing Structure in In-Context Learning

Abstract

Mechanism-targeted synthetic data is increasingly proposed as a way to steer pretraining toward desirable capabilities, but it remains unclear how such interventions should be evaluated. We study this question for in-context learning (ICL) under matched compute (iso-FLOPs) using Bi-Induct, a lightweight data rewrite that interleaves short directional copy snippets into a natural pretraining stream: forward-copy (induction), backward-copy (anti-induction, as a directional control), or a balanced mix. Across 0.13B-1B decoder-only models, we evaluate (i) few-shot performance on standard LM benchmarks and function-style ICL probes, (ii) head-level copy telemetry, and (iii) held-out perplexity as a guardrail. Bi-Induct reliably increases induction-head activity, but this does not translate into consistent improvements in few-shot generalization: on standard LM benchmarks, Bi-Induct is largely performance-neutral relative to natural-only training, while on function-style probes the 1B natural-only model performs best. Despite explicit backward-copy cues, anti-induction scores remain near zero across scales, revealing a strong forward/backward asymmetry. Targeted ablations show a sharper distinction: removing the top 2% induction heads per layer harms ICL more than matched random ablations, with the largest relative drop occurring in the natural-only models. This indicates that natural-only training produces more centralized, load-bearing induction circuitry, whereas Bi-Induct tends to create more distributed and redundant induction activity. Our main conclusion is that eliciting a mechanism is not the same as making it load-bearing. For data-centric foundation model design, this suggests that synthetic data interventions should be evaluated not only by signature amplification, but by whether they create causally necessary computation while preserving natural-data modeling quality.

Paper Structure

This paper contains 44 sections, 10 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Examples of copy-style snippets injected into the pretraining stream. Each snippet is a span of $L$ random non-special tokens, followed by a separator, then either the same span (induction) or the reversed span (anti-induction). Colors align repeated tokens across the two halves. The illustration uses $L{=}5$ for clarity.
  • Figure 2: ICL Composite (macro) across two evaluation families: (a) Standard LM benchmarks; (b) todd2024functionvectorslargelanguage's function-probe suite. Each panel groups by model size (0.13B, 0.5B, 1B), bar colors by training regime (Baseline, Induction, Anti, Balanced); error bars show $\pm$1 s.d. For per-task results see Appendix \ref{['app:icl_performance']}, Table \ref{['tab:main_icl_performance']}.
  • Figure 3: Layer-wise copy-head telemetry. Top row: induction scores; bottom row: anti-induction scores. For each layer we plot the best-scoring head (top 2% by score with a floor of one head per layer), averaged over three seeds, for the 0.13B, 0.5B, and 1B models. Head counts for each model are given in Table \ref{['tab:model-presets']}.
  • Figure 4: Layer-wise copy-head telemetry. Top row: induction scores; bottom row: anti-induction scores. For each layer we plot the best-scoring head (top 2% by score with a floor of one head per layer), averaged over six seeds, for the 0.13B model with initial mix ratios: 25%, 50%, and 100% . Head counts for each model are given in Table \ref{['tab:model-presets']}.
  • Figure 5: Sensitivity of ICL composite (macro) to the number of shots across two evaluation families: (a) standard LM benchmarks (3-shot vs. 1-shot); (b) Function-probe suite of todd2024functionvectorslargelanguage (10-shot vs. 1-shot). Each panel groups models by size (0.13B, 0.5B, 1B), colors the bars by regime (Baseline, Induction, Anti, Balanced), and shows $\pm 1$ s.d. error bars.
  • ...and 3 more figures