Table of Contents
Fetching ...

Tool-Augmented Policy Optimization: Synergizing Reasoning and Adaptive Tool Use with Reinforcement Learning

Wenxun Wu, Yuanyang Li, Guhan Chen, Linyue Wang, Hongyang Chen

TL;DR

TAPO introduces a tool-augmented policy optimization framework that integrates multi-hop reasoning with adaptive tool calling (search and code execution) to address knowledge updates and complex computations in LLMs. Built on the on-policy Dynamic Sampling Policy Optimization paradigm, TAPO jointly trains reasoning and tool-use behaviors using a task-specific XML-like formatting and a two-tool pipeline, achieving efficient tool utilization and mitigating reward hacking. The authors provide two novel datasets, TAPO-easy-60K and TAPO-hard-18K, to train and evaluate knowledge-retrieval and mathematical computation capabilities, and demonstrate state-of-the-art performance for 7B and 3B models among comparable-parameter baselines with strong generalization to out-of-domain tasks. The work highlights the practical potential of coupling advanced reasoning with external tools for knowledge-intensive and computation-heavy applications, while also identifying challenges such as GPU bottlenecks and limited generalization in smaller models for future investigation.

Abstract

Recent advances in large language models (LLMs) have popularized test-time scaling, where models generate additional reasoning tokens before producing final answers. These approaches have demonstrated significant performance improvements on benchmarks involving mathematical reasoning. However, language models relying solely on direct inference still struggle with tasks demanding up-to-date knowledge or computational tools such as calculators and code interpreters for complex arithmetic operations. To overcome these limitations, we propose Tool-Augmented Policy Optimization (TAPO), a novel reinforcement learning framework that systematically integrates multi-hop reasoning with adaptive tool-calling capabilities. Our approach employs a modified version of Dynamic Sampling Policy Optimization (DAPO), a recently developed RL paradigm, which we adapt specifically for tool invocation scenarios, enabling models to dynamically interleave complex reasoning with on-demand tool usage (including search APIs and Python interpreters). To support this research, we introduce two new datasets: TAPO-easy-60K and TAPO-hard-18K, specifically designed to train and evaluate both fact-based reasoning and mathematical calculation capabilities. Our experiments on Qwen2.5-3B and Qwen2.5-7B models demonstrate the effectiveness of our approach, with both models achieving state-of-the-art performance on tasks requiring external knowledge and mathematical computation among methods with comparable parameters. Notably, TAPO achieves more efficient tool utilization than baseline methods while preventing excessive calls caused by reward hacking. These results highlight the significant potential of combining advanced reasoning with tool usage to enhance model performance in knowledge-intensive and computationally demanding tasks.

Tool-Augmented Policy Optimization: Synergizing Reasoning and Adaptive Tool Use with Reinforcement Learning

TL;DR

TAPO introduces a tool-augmented policy optimization framework that integrates multi-hop reasoning with adaptive tool calling (search and code execution) to address knowledge updates and complex computations in LLMs. Built on the on-policy Dynamic Sampling Policy Optimization paradigm, TAPO jointly trains reasoning and tool-use behaviors using a task-specific XML-like formatting and a two-tool pipeline, achieving efficient tool utilization and mitigating reward hacking. The authors provide two novel datasets, TAPO-easy-60K and TAPO-hard-18K, to train and evaluate knowledge-retrieval and mathematical computation capabilities, and demonstrate state-of-the-art performance for 7B and 3B models among comparable-parameter baselines with strong generalization to out-of-domain tasks. The work highlights the practical potential of coupling advanced reasoning with external tools for knowledge-intensive and computation-heavy applications, while also identifying challenges such as GPU bottlenecks and limited generalization in smaller models for future investigation.

Abstract

Recent advances in large language models (LLMs) have popularized test-time scaling, where models generate additional reasoning tokens before producing final answers. These approaches have demonstrated significant performance improvements on benchmarks involving mathematical reasoning. However, language models relying solely on direct inference still struggle with tasks demanding up-to-date knowledge or computational tools such as calculators and code interpreters for complex arithmetic operations. To overcome these limitations, we propose Tool-Augmented Policy Optimization (TAPO), a novel reinforcement learning framework that systematically integrates multi-hop reasoning with adaptive tool-calling capabilities. Our approach employs a modified version of Dynamic Sampling Policy Optimization (DAPO), a recently developed RL paradigm, which we adapt specifically for tool invocation scenarios, enabling models to dynamically interleave complex reasoning with on-demand tool usage (including search APIs and Python interpreters). To support this research, we introduce two new datasets: TAPO-easy-60K and TAPO-hard-18K, specifically designed to train and evaluate both fact-based reasoning and mathematical calculation capabilities. Our experiments on Qwen2.5-3B and Qwen2.5-7B models demonstrate the effectiveness of our approach, with both models achieving state-of-the-art performance on tasks requiring external knowledge and mathematical computation among methods with comparable parameters. Notably, TAPO achieves more efficient tool utilization than baseline methods while preventing excessive calls caused by reward hacking. These results highlight the significant potential of combining advanced reasoning with tool usage to enhance model performance in knowledge-intensive and computationally demanding tasks.

Paper Structure

This paper contains 43 sections, 11 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: An Example of Policy Model Inference in TAPO. The policy language model generates multi-stage reasoning outputs while automatically appending tool-calling responses (e.g., from search engine or Python code interpreter) when external tools are invoked.
  • Figure 2: Overview of the TAPO pipeline. (1) The policy model first generates a batch of tool-augmented responses, which are then scored based on ground truths. (2) Responses are dynamically sampled from groups with non-zero standard deviation to ensure diversity. (3) Finally, the advantage function is computed and used to update the policy model.
  • Figure 3: Distribution of average tool invocation counts across benchmark datasets per rollout.
  • Figure 4: Search engine invocation frequency versus performance on the NQ dataset.
  • Figure 5: Reward score progression during training.
  • ...and 3 more figures