Table of Contents
Fetching ...

THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning

Qikai Chang, Zhenrong Zhang, Pengfei Hu, Jun Du, Jiefeng Ma, Yicheng Pan, Jianshu Zhang, Quan Liu, Jianqing Gao

TL;DR

THOR tackles the core bottlenecks of tool-integrated reasoning for mathematics by introducing TIRGen for policy-aligned data generation, a hierarchical RL framework that jointly optimizes episode-level problem solving and step-level code generation, and a self-correction mechanism that leverages immediate tool feedback during inference. The approach generalizes across reasoning and non-reasoning models, achieving state-of-the-art performance on math benchmarks for models of similar scale and consistently boosting code-generation benchmarks without additional fine-tuning. By combining precise tool execution with structured RL, THOR reduces inference overhead and enhances reliability in complex symbolic and numerical tasks, with code and data to be released for reproducibility. Overall, THOR presents a practical, scalable path to robust, tool-enabled mathematical reasoning in LLMs.

Abstract

Large Language Models (LLMs) have made remarkable progress in mathematical reasoning, but still continue to struggle with high-precision tasks like numerical computation and formal symbolic manipulation. Integrating external tools has emerged as a promising approach to bridge this gap. Despite recent advances, existing methods struggle with three key challenges: constructing tool-integrated reasoning data, performing fine-grained optimization, and enhancing inference. To overcome these limitations, we propose THOR (Tool-Integrated Hierarchical Optimization via RL). First, we introduce TIRGen, a multi-agent actor-critic-based pipeline for constructing high-quality datasets of tool-integrated reasoning paths, aligning with the policy and generalizing well across diverse models. Second, to perform fine-grained hierarchical optimization, we introduce an RL strategy that jointly optimizes for both episode-level problem solving and step-level code generation. This is motivated by our key insight that the success of an intermediate tool call is a strong predictor of the final answer's correctness. Finally, THOR incorporates a self-correction mechanism that leverages immediate tool feedback to dynamically revise erroneous reasoning paths during inference. Our approach demonstrates strong generalization across diverse models, performing effectively in both reasoning and non-reasoning models. It further achieves state-of-the-art performance for models of a similar scale on multiple mathematical benchmarks, while also delivering consistent improvements on code benchmarks. Our code will be publicly available at https://github.com/JingMog/THOR.

THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning

TL;DR

THOR tackles the core bottlenecks of tool-integrated reasoning for mathematics by introducing TIRGen for policy-aligned data generation, a hierarchical RL framework that jointly optimizes episode-level problem solving and step-level code generation, and a self-correction mechanism that leverages immediate tool feedback during inference. The approach generalizes across reasoning and non-reasoning models, achieving state-of-the-art performance on math benchmarks for models of similar scale and consistently boosting code-generation benchmarks without additional fine-tuning. By combining precise tool execution with structured RL, THOR reduces inference overhead and enhances reliability in complex symbolic and numerical tasks, with code and data to be released for reproducibility. Overall, THOR presents a practical, scalable path to robust, tool-enabled mathematical reasoning in LLMs.

Abstract

Large Language Models (LLMs) have made remarkable progress in mathematical reasoning, but still continue to struggle with high-precision tasks like numerical computation and formal symbolic manipulation. Integrating external tools has emerged as a promising approach to bridge this gap. Despite recent advances, existing methods struggle with three key challenges: constructing tool-integrated reasoning data, performing fine-grained optimization, and enhancing inference. To overcome these limitations, we propose THOR (Tool-Integrated Hierarchical Optimization via RL). First, we introduce TIRGen, a multi-agent actor-critic-based pipeline for constructing high-quality datasets of tool-integrated reasoning paths, aligning with the policy and generalizing well across diverse models. Second, to perform fine-grained hierarchical optimization, we introduce an RL strategy that jointly optimizes for both episode-level problem solving and step-level code generation. This is motivated by our key insight that the success of an intermediate tool call is a strong predictor of the final answer's correctness. Finally, THOR incorporates a self-correction mechanism that leverages immediate tool feedback to dynamically revise erroneous reasoning paths during inference. Our approach demonstrates strong generalization across diverse models, performing effectively in both reasoning and non-reasoning models. It further achieves state-of-the-art performance for models of a similar scale on multiple mathematical benchmarks, while also delivering consistent improvements on code benchmarks. Our code will be publicly available at https://github.com/JingMog/THOR.

Paper Structure

This paper contains 29 sections, 8 equations, 12 figures, 5 tables, 1 algorithm.

Figures (12)

  • Figure 2: The TIR data construction pipeline. In this pipeline, the Actor agent generates reasoning steps. The Critic agent identifies tool-executable steps and converts them into tool-augmented reasoning steps. After multi-stage filtering, we obtain the cold start dataset $\mathcal{D}_{SFT}$.
  • Figure 3: A hierarchical optimization framework comprising (a) episode-level RL for mathematical problem solving and (b) step-level optimization for code generation. In addition, we introduce (c) a self-correction mechanism for online error correction during inference.
  • Figure 4: Ablation on cold-start efficiency. We compare our TIRGen against other TIR datasets, including Long CoT from Nemotron and Short CoT from ReTool. Results are reported as code ratio in (a) and pass@16 in (b) and (c), demonstrating the effectiveness of TIRGen and cold start.
  • Figure 5: Pass@1 accuracy on code generation benchmarks.
  • Figure 6: The distribution of code call rounds in the cold start dataset $\mathcal{D}_{SFT}$.
  • ...and 7 more figures