AI-rithmetic
Alex Bie, Travis Dick, Alex Kulesza, Prabhakar Raghavan, Vinod Raman, Sergei Vassilvitskii
TL;DR
This study interrogates why frontier large-language models struggle with basic arithmetic, focusing on two-argument integer addition. Through a large-scale empirical evaluation across models and digit lengths, the authors reveal a degradation in accuracy with increasing length and identify two dominant error modes—misalignment and close carry—with tokenization and carry independence driving these patterns. They show that an addition with $n$ close carries follows a simple stochastic model, yielding a correctness probability of $(1-p)^n$, aligning with observed error distributions. The findings imply that simply scaling models will not guarantee reliable arithmetic, and they advocate for tool-assisted or architecture-level solutions to achieve robust, deterministic arithmetic in AI systems.
Abstract
Modern AI systems have been successfully deployed to win medals at international math competitions, assist with research workflows, and prove novel technical lemmas. However, despite their progress at advanced levels of mathematics, they remain stubbornly bad at basic arithmetic, consistently failing on the simple task of adding two numbers. We present a systematic investigation of this phenomenon. We demonstrate empirically that all frontier models suffer significantly degraded accuracy for integer addition as the number of digits increases. Furthermore, we show that most errors made by these models are highly interpretable and can be attributed to either operand misalignment or a failure to correctly carry; these two error classes explain 87.9%, 62.9%, and 92.4% of Claude Opus 4.1, GPT-5, and Gemini 2.5 Pro errors, respectively. Finally, we show that misalignment errors are frequently related to tokenization, and that carrying errors appear largely as independent random failures.
