MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs
Andreas Opedal, Haruki Shirakami, Bernhard Schölkopf, Abulhair Saparov, Mrinmaya Sachan
TL;DR
MathGAP introduces a formal, controllable framework for evaluating LLMs on arithmetic word problems with arbitrarily complex proofs by generating synthetic tasks whose proof trees vary in depth $D$, width $W$, and shape. Problems are grounded in a world-model of logical forms, enabling automated post-order CoT traces and ground-truth reasoning. The study systematically analyzes easy-to-hard generalization across depth, width, nonlinear shapes, and sentence-ordering, revealing robust OOD generalization gaps: accuracy declines with increasing complexity, nonlinear proofs are especially challenging, and performance is sensitive to problem phrasing and prompts. The results highlight both the potential and the limits of current LLM reasoning and underscore the value of MathGAP for benchmarking and debugging arithmetic reasoning capabilities in a contamination-free setting, with code and datasets made publicly available.
Abstract
Large language models (LLMs) can solve arithmetic word problems with high accuracy, but little is known about how well they generalize to more complex problems. This is difficult to study, as (i) much of the available evaluation data has already been seen by the most capable models during training, and (ii) existing benchmarks do not capture how problem proofs may be arbitrarily complex in various ways. In this paper, we present a data-generation framework for evaluating LLMs on problems with arbitrarily complex arithmetic proofs, called MathGAP. MathGAP generates problem statements and chain-of-thought reasoning traces according to specifications about their arithmetic proof structure, enabling systematic studies on easy-to-hard generalization with respect to complexity of proof trees. Using MathGAP, we find that LLMs show a significant decrease in performance as proofs get deeper and wider. This effect is more pronounced in complex, nonlinear proof structures, which are challenging even for the most capable models. The models are also sensitive to simple changes in sentence ordering. However, they remain capable of solving some complex problems, suggesting that reasoning generalization is noisy.
