Table of Contents
Fetching ...

ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning

Hyeonje Choi, Jeongsoo Lee, Hyojun Lee, Jay-Yoon Lee

TL;DR

Comparisons between tool-use protocols emphasize that improvements come less from local action selection and more from long-range plan coherence and disciplined use of observations, and that improvements come less from local action selection and more from long-range plan coherence.

Abstract

We introduce \ToolMATH, a math-grounded benchmark that evaluates tool-augmented language models in realistic multi-tool environments where the output depends on calling schema-specified tools and sustaining multi-step execution. It turns math problems into a controlled, correctness-checkable benchmark with tool sets, enabling systematic evaluation of model reliability under (1) large, overlapping tool catalogs and (2) the absence of the intended capability. \ToolMATH provides actionable diagnostic evidence of failure modes in tool-augmented agents, helping identify the control mechanisms required for robustness. \ToolMATH roughly contains 8k questions and 12k tools; we provide an additional hard-set \ToolMATHHard with questions and tools. Our evaluation reveals that the key failure factor is due to the inability to reason, leading to the accumulation of intermediate results' errors and constrain later decisions. Tool-list redundancy do not simply add noise, but amplify small early deviations into irreversible execution drift. The benchmark highlights that when the intended capability is missing, distractor tools can sometimes serve as partial substitutes in solution paths, yet they can also mislead models into ungrounded tool trajectories. Finally, comparisons between tool-use protocols emphasize that improvements come less from local action selection and more from long-range plan coherence and disciplined use of observations.

ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning

TL;DR

Comparisons between tool-use protocols emphasize that improvements come less from local action selection and more from long-range plan coherence and disciplined use of observations, and that improvements come less from local action selection and more from long-range plan coherence.

Abstract

We introduce \ToolMATH, a math-grounded benchmark that evaluates tool-augmented language models in realistic multi-tool environments where the output depends on calling schema-specified tools and sustaining multi-step execution. It turns math problems into a controlled, correctness-checkable benchmark with tool sets, enabling systematic evaluation of model reliability under (1) large, overlapping tool catalogs and (2) the absence of the intended capability. \ToolMATH provides actionable diagnostic evidence of failure modes in tool-augmented agents, helping identify the control mechanisms required for robustness. \ToolMATH roughly contains 8k questions and 12k tools; we provide an additional hard-set \ToolMATHHard with questions and tools. Our evaluation reveals that the key failure factor is due to the inability to reason, leading to the accumulation of intermediate results' errors and constrain later decisions. Tool-list redundancy do not simply add noise, but amplify small early deviations into irreversible execution drift. The benchmark highlights that when the intended capability is missing, distractor tools can sometimes serve as partial substitutes in solution paths, yet they can also mislead models into ungrounded tool trajectories. Finally, comparisons between tool-use protocols emphasize that improvements come less from local action selection and more from long-range plan coherence and disciplined use of observations.
Paper Structure (112 sections, 13 equations, 15 figures, 1 table)

This paper contains 112 sections, 13 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: ToolMATH construction and evaluation pipeline. Step 1: Tool extraction and validation. We convert MATH solution steps into schema-specified Python tools and retain only tools whose described semantics are consistent with their executed behavior via validation. Step 2: Tool-grounded evaluation with controlled redundancy. For each problem, we form a tool environment by combining its gold tools with distractor tools sampled from other problems.
  • Figure 2: GPT-4o-mini average accuracy across logical-hop groups(averaging the number of distractors $k$). Curves correspond to No tools, Gold-only, and gold + distractors with similarity level 1-5. Accuracy degrades nearly monotonically with hop count, while higher levels amplify variance in higher-hop regimes.
  • Figure 3: Failure cases under Gold-present (Level 3, $k{=}5$). Counts from $100$ failed instances per model; multiple labels per instance allowed.
  • Figure 4: ToolMATH vs. ToolMATH-Hard: hop-wise accuracy under tool availability and insufficiency (Llama 3-8B and Qwen 2.5-7B). Left: ToolMATH. Right: ToolMATH-Hard. We report accuracy by logical-hop group under No tools, Gold-only, and Distractors-only. For all settings involving distractors, we fix the distractor list to pure random sampling (Level 2) with $k=10$ tools.
  • Figure 5: Hop-wise framework comparison across models on both ToolMATH and ToolMATH-Hard under the Gold-only tool list. Across the two datasets, framework differences separate strongly at higher hops, where Plan+ReAct most reliably maintains high accuracy, highlighting long-horizon plan coherence as the key bottleneck. (Hop 1 is absent in ToolMATH-Hard.)
  • ...and 10 more figures

Theorems & Definitions (1)

  • Remark 5.1