Table of Contents
Fetching ...

MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs

Andreas Opedal, Haruki Shirakami, Bernhard Schölkopf, Abulhair Saparov, Mrinmaya Sachan

TL;DR

MathGAP introduces a formal, controllable framework for evaluating LLMs on arithmetic word problems with arbitrarily complex proofs by generating synthetic tasks whose proof trees vary in depth $D$, width $W$, and shape. Problems are grounded in a world-model of logical forms, enabling automated post-order CoT traces and ground-truth reasoning. The study systematically analyzes easy-to-hard generalization across depth, width, nonlinear shapes, and sentence-ordering, revealing robust OOD generalization gaps: accuracy declines with increasing complexity, nonlinear proofs are especially challenging, and performance is sensitive to problem phrasing and prompts. The results highlight both the potential and the limits of current LLM reasoning and underscore the value of MathGAP for benchmarking and debugging arithmetic reasoning capabilities in a contamination-free setting, with code and datasets made publicly available.

Abstract

Large language models (LLMs) can solve arithmetic word problems with high accuracy, but little is known about how well they generalize to more complex problems. This is difficult to study, as (i) much of the available evaluation data has already been seen by the most capable models during training, and (ii) existing benchmarks do not capture how problem proofs may be arbitrarily complex in various ways. In this paper, we present a data-generation framework for evaluating LLMs on problems with arbitrarily complex arithmetic proofs, called MathGAP. MathGAP generates problem statements and chain-of-thought reasoning traces according to specifications about their arithmetic proof structure, enabling systematic studies on easy-to-hard generalization with respect to complexity of proof trees. Using MathGAP, we find that LLMs show a significant decrease in performance as proofs get deeper and wider. This effect is more pronounced in complex, nonlinear proof structures, which are challenging even for the most capable models. The models are also sensitive to simple changes in sentence ordering. However, they remain capable of solving some complex problems, suggesting that reasoning generalization is noisy.

MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs

TL;DR

MathGAP introduces a formal, controllable framework for evaluating LLMs on arithmetic word problems with arbitrarily complex proofs by generating synthetic tasks whose proof trees vary in depth , width , and shape. Problems are grounded in a world-model of logical forms, enabling automated post-order CoT traces and ground-truth reasoning. The study systematically analyzes easy-to-hard generalization across depth, width, nonlinear shapes, and sentence-ordering, revealing robust OOD generalization gaps: accuracy declines with increasing complexity, nonlinear proofs are especially challenging, and performance is sensitive to problem phrasing and prompts. The results highlight both the potential and the limits of current LLM reasoning and underscore the value of MathGAP for benchmarking and debugging arithmetic reasoning capabilities in a contamination-free setting, with code and datasets made publicly available.

Abstract

Large language models (LLMs) can solve arithmetic word problems with high accuracy, but little is known about how well they generalize to more complex problems. This is difficult to study, as (i) much of the available evaluation data has already been seen by the most capable models during training, and (ii) existing benchmarks do not capture how problem proofs may be arbitrarily complex in various ways. In this paper, we present a data-generation framework for evaluating LLMs on problems with arbitrarily complex arithmetic proofs, called MathGAP. MathGAP generates problem statements and chain-of-thought reasoning traces according to specifications about their arithmetic proof structure, enabling systematic studies on easy-to-hard generalization with respect to complexity of proof trees. Using MathGAP, we find that LLMs show a significant decrease in performance as proofs get deeper and wider. This effect is more pronounced in complex, nonlinear proof structures, which are challenging even for the most capable models. The models are also sensitive to simple changes in sentence ordering. However, they remain capable of solving some complex problems, suggesting that reasoning generalization is noisy.

Paper Structure

This paper contains 44 sections, 2 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: We propose MathGAP, an evaluation framework for arithmetic reasoning in which LLMs are tested on problems with proofs of arbitrary complexity. This diagram shows how problems and CoT solution annotations are generated under our formalism. The complete list of logical forms and inference rules that we consider in our experiments are given in \ref{['table:lf_examples', 'tab:rules']}, respectively.
  • Figure 2: Answer accuracies for generalization to increasing depth and width for linear problems across models and in-context distributions. Depth is increased using inference rules involving comp-NoValue- (-NoValue-)and transfer-NoValue- (-NoValue-). Width is increased using partwhole. See \ref{['exampleprobs']} for example problems.
  • Figure 3: Answer accuracies for generalization to increasing depth and width for nonlinear problems across models and in-context distributions. Depth is increased using inference rules involving comp-NoValue- (-NoValue-)and comp-eq-NoValue- (-NoValue-). The in-distribution contexts were too large to fit in context for the Llama3 models and GPT-3.5 Turbo at some of the higher depths. \ref{['sec:o1-results']} shows results on o1-preview and DeepSeek-R1.
  • Figure 4: Answer accuracies for generalization to permutations across models and in-context distributions. We measure complexity as the distance of the movement of a sentence to the beginning of the problem, as compared to its position in the canonical ordering of the problem.
  • Figure 5: World model (right) of a math word problem (left). Each sentence in the problem text is represented by a logical form which consists of a predicate with property arguments. The logical forms in the body are used as axioms in the proof of the problem, as shown in \ref{['fig:prooftree']}. The sentences and logical forms labeled (1) through (5) are the body of the problem, (6) is the question and (7) is the answer.
  • ...and 3 more figures