Table of Contents
Fetching ...

AI-rithmetic

Alex Bie, Travis Dick, Alex Kulesza, Prabhakar Raghavan, Vinod Raman, Sergei Vassilvitskii

TL;DR

This study interrogates why frontier large-language models struggle with basic arithmetic, focusing on two-argument integer addition. Through a large-scale empirical evaluation across models and digit lengths, the authors reveal a degradation in accuracy with increasing length and identify two dominant error modes—misalignment and close carry—with tokenization and carry independence driving these patterns. They show that an addition with $n$ close carries follows a simple stochastic model, yielding a correctness probability of $(1-p)^n$, aligning with observed error distributions. The findings imply that simply scaling models will not guarantee reliable arithmetic, and they advocate for tool-assisted or architecture-level solutions to achieve robust, deterministic arithmetic in AI systems.

Abstract

Modern AI systems have been successfully deployed to win medals at international math competitions, assist with research workflows, and prove novel technical lemmas. However, despite their progress at advanced levels of mathematics, they remain stubbornly bad at basic arithmetic, consistently failing on the simple task of adding two numbers. We present a systematic investigation of this phenomenon. We demonstrate empirically that all frontier models suffer significantly degraded accuracy for integer addition as the number of digits increases. Furthermore, we show that most errors made by these models are highly interpretable and can be attributed to either operand misalignment or a failure to correctly carry; these two error classes explain 87.9%, 62.9%, and 92.4% of Claude Opus 4.1, GPT-5, and Gemini 2.5 Pro errors, respectively. Finally, we show that misalignment errors are frequently related to tokenization, and that carrying errors appear largely as independent random failures.

AI-rithmetic

TL;DR

This study interrogates why frontier large-language models struggle with basic arithmetic, focusing on two-argument integer addition. Through a large-scale empirical evaluation across models and digit lengths, the authors reveal a degradation in accuracy with increasing length and identify two dominant error modes—misalignment and close carry—with tokenization and carry independence driving these patterns. They show that an addition with close carries follows a simple stochastic model, yielding a correctness probability of , aligning with observed error distributions. The findings imply that simply scaling models will not guarantee reliable arithmetic, and they advocate for tool-assisted or architecture-level solutions to achieve robust, deterministic arithmetic in AI systems.

Abstract

Modern AI systems have been successfully deployed to win medals at international math competitions, assist with research workflows, and prove novel technical lemmas. However, despite their progress at advanced levels of mathematics, they remain stubbornly bad at basic arithmetic, consistently failing on the simple task of adding two numbers. We present a systematic investigation of this phenomenon. We demonstrate empirically that all frontier models suffer significantly degraded accuracy for integer addition as the number of digits increases. Furthermore, we show that most errors made by these models are highly interpretable and can be attributed to either operand misalignment or a failure to correctly carry; these two error classes explain 87.9%, 62.9%, and 92.4% of Claude Opus 4.1, GPT-5, and Gemini 2.5 Pro errors, respectively. Finally, we show that misalignment errors are frequently related to tokenization, and that carrying errors appear largely as independent random failures.
Paper Structure (16 sections, 1 equation, 8 figures, 4 tables)

This paper contains 16 sections, 1 equation, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Frontier models fail at adding long numbers. The plotted points show the accuracy of the tested models on two-argument addition problems at each length. Since model performance fluctuates significantly, we include a centered 10-point moving average curve.
  • Figure 2: Most errors can be explained. We plot the distribution of error types at each length. For a description of error types, see Section \ref{['sec:exploration']}. We find that close carry and misalignment errors cover 87.9% of Claude Opus 4.1's mistakes, 62.9% of GPT-5's mistakes, 92.4% of Gemini 2.5 Pro's mistakes, and at least 55.6% of every model's mistakes.
  • Figure 3: Bimodal edit distance distribution of mistakes corresponds to error classes. We plot a histogram of edit distance from the true answer for all mistakes, and color by its error classification. We observe a spike of small edit distance errors for close carry, and a spike of large edit distance errors for misalignment.
  • Figure 4: For each incorrect response made by a model, we find the left-most incorrect digit and calculate (a) the delta between it and the correct digit, and (b) the long addition column sum one position to the right (i.e., the column that would carry into the incorrect column). This figure shows how common each digit delta and next column sum are for each model. A significant fraction of the total count falls on the (delta=-1, next sum=10) and (delta=1, next sum = 9) positions, which exactly characterizes close carry mistakes.
  • Figure 5: The misalignment offset that produces the longest misaligned prefix match on examples classified as a misalignment error. Models are biased towards positive offsets (corresponding to a rightward shift of the second argument). Note the modal offset of 3 for Claude Opus 4.1 and GPT-4o, and 1 for Gemini and Gemma; we attribute this to digit tokenization (see Section \ref{['sec:tokenization']} for further discussion).
  • ...and 3 more figures