Table of Contents
Fetching ...

Arithmetic OOD Failure Unfolds in Stages in Minimal GPTs

Seine A. Shintani

Abstract

Arithmetic benchmarks are often reduced to a single held-out score, but that score can conflate qualitatively different failures. We study a controlled minimal GPT trained on exhaustive 2-digit addition, where all local digit transitions are already present in training, and ask why 3-digit generalization still fails. The failure is staged. First, there is a layout barrier: a learned absolute-position model collapses under a pure 3-digit layout shift, and mixed-layout exposure is the only intervention that materially weakens this barrier. Second, after layout repair, the hundreds position behaves like a carry flag rather than a semantic hundreds digit; targeted carry probes reverse the relevant logit margin, whereas a matched extra-data control does not. Third, after carry repair, the main remaining bottleneck is conditional recomposition: high-conditioned tail data outperforms a matched control, high-only data, and tail-only data on all true-3-digit suites, and the same ordering reappears in a larger 2-layer bridge experiment. The residual errors after recomposition are then overwhelmingly tens-only, and a separate 10-seed late-stage study shows that a sign-aware tens repair raises exact match on the hardest thousands-carry suite from 0.664 to 0.822. We therefore provide an experimentally testable decomposition of arithmetic OOD failure into layout, carry-semantics, recomposition, and late tens-residual stages.

Arithmetic OOD Failure Unfolds in Stages in Minimal GPTs

Abstract

Arithmetic benchmarks are often reduced to a single held-out score, but that score can conflate qualitatively different failures. We study a controlled minimal GPT trained on exhaustive 2-digit addition, where all local digit transitions are already present in training, and ask why 3-digit generalization still fails. The failure is staged. First, there is a layout barrier: a learned absolute-position model collapses under a pure 3-digit layout shift, and mixed-layout exposure is the only intervention that materially weakens this barrier. Second, after layout repair, the hundreds position behaves like a carry flag rather than a semantic hundreds digit; targeted carry probes reverse the relevant logit margin, whereas a matched extra-data control does not. Third, after carry repair, the main remaining bottleneck is conditional recomposition: high-conditioned tail data outperforms a matched control, high-only data, and tail-only data on all true-3-digit suites, and the same ordering reappears in a larger 2-layer bridge experiment. The residual errors after recomposition are then overwhelmingly tens-only, and a separate 10-seed late-stage study shows that a sign-aware tens repair raises exact match on the hardest thousands-carry suite from 0.664 to 0.822. We therefore provide an experimentally testable decomposition of arithmetic OOD failure into layout, carry-semantics, recomposition, and late tens-residual stages.

Paper Structure

This paper contains 21 sections, 1 equation, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Overview of the benchmark ladder and repair program. Exhaustive 2-digit addition already covers all 200 local digit transitions, so later 3-digit failures cannot be explained by missing local rules alone. The evaluation suites are ordered so that each new suite adds one qualitatively different source of difficulty. The repair families are matched to the active barrier rather than being generic augmentation.
  • Figure 2: 10-seed late-stage confirmation: pooled exact-match means on the three true-3-digit suites. This is a downstream confirmation study in the larger 2-layer, width-32 model. tenspolarity is best on the no-incoming-carry and thousands-carry suites, while tensboundary is narrowly best on the incoming-carry suite.
  • Figure 3: 10-seed late-stage confirmation: recomposition after the upper digits are already correct. The metric is $P(\mathrm{exact}\mid\mathrm{high2\ correct})$. Improvements here are important because they show that later gains are not explained only by fixing the upper digits.
  • Figure 4: 10-seed late-stage confirmation: signed tens residuals. Negative values indicate systematic under-shooting of the tens digit; positive values indicate systematic over-shooting. tenspolarity moves both $c_2=0$ and $c_2=1$ branches closer to zero, most clearly on the hardest thousands-carry suite.