Table of Contents
Fetching ...

Adaptive Tool Generation with Models as Tools and Reinforcement Learning

Chenpeng Wang, Xiaojie Cheng, Chunye Wang, Linfeng Yang, Lei Zhang

TL;DR

The paper tackles the scalability and reliability limitations of tool-augmented LLMs that rely on live APIs by introducing Model-as-Tools Reasoning (MTR), a simulation-first framework with three specialized agents that generate and simulate tool interfaces and observations. It decouples structural learning from strategic optimization through a two-stage training pipeline: Stage 1 supervised fine-tuning on complete ReAct traces to learn trace grammar, and Stage 2 Group Relative Policy Optimization to refine tool-use strategy via a composite reward that emphasizes both correctness and internal consistency. Empirical results across four multi-hop QA benchmarks show that MTR achieves competitive exact-match performance without API dependencies and excels on reasoning-intensive tasks, notably outperforming baselines on Bamboogle. The approach demonstrates that rich tool reasoning can be learned from structured, simulated traces, offering scalable, stable training for tool-based QA without live API access, with potential for hybrid deployments in the future.

Abstract

Tool-augmented language models have demonstrated strong capabilities, but their reliance on live API access creates scalability and reliability challenges during training and deployment. We propose MTR, a simulation-first training framework for tool-augmented reasoning. Instead of relying on live APIs, MTR learns from complete ReAct traces with schema-validated, simulated observations. Our approach operates through a multi-agent architecture where a ToolMaker generates task-specific, OpenAI-compatible tool interfaces, an AutoAgent produces structured think-act-observe sequences, and a ToolActor simulates realistic responses. Training proceeds in two stages: Stage-1 Supervised Fine-Tuning (SFT) teaches 'trace grammar' from complete reasoning sequences; Stage-2 Group Relative Policy Optimization (GRPO) optimizes strategy with a composite trace reward that balances answer correctness and internal consistency. Across four multi-hop QA benchmarks (HotpotQA, MuSiQue, 2WikiMultiHopQA, Bamboogle), MTR attains competitive Exact Match (EM) scores to live-API systems and excels on reasoning-intensive tasks, suggesting that effective tool reasoning can be learned from structured traces without live interactions.

Adaptive Tool Generation with Models as Tools and Reinforcement Learning

TL;DR

The paper tackles the scalability and reliability limitations of tool-augmented LLMs that rely on live APIs by introducing Model-as-Tools Reasoning (MTR), a simulation-first framework with three specialized agents that generate and simulate tool interfaces and observations. It decouples structural learning from strategic optimization through a two-stage training pipeline: Stage 1 supervised fine-tuning on complete ReAct traces to learn trace grammar, and Stage 2 Group Relative Policy Optimization to refine tool-use strategy via a composite reward that emphasizes both correctness and internal consistency. Empirical results across four multi-hop QA benchmarks show that MTR achieves competitive exact-match performance without API dependencies and excels on reasoning-intensive tasks, notably outperforming baselines on Bamboogle. The approach demonstrates that rich tool reasoning can be learned from structured, simulated traces, offering scalable, stable training for tool-based QA without live API access, with potential for hybrid deployments in the future.

Abstract

Tool-augmented language models have demonstrated strong capabilities, but their reliance on live API access creates scalability and reliability challenges during training and deployment. We propose MTR, a simulation-first training framework for tool-augmented reasoning. Instead of relying on live APIs, MTR learns from complete ReAct traces with schema-validated, simulated observations. Our approach operates through a multi-agent architecture where a ToolMaker generates task-specific, OpenAI-compatible tool interfaces, an AutoAgent produces structured think-act-observe sequences, and a ToolActor simulates realistic responses. Training proceeds in two stages: Stage-1 Supervised Fine-Tuning (SFT) teaches 'trace grammar' from complete reasoning sequences; Stage-2 Group Relative Policy Optimization (GRPO) optimizes strategy with a composite trace reward that balances answer correctness and internal consistency. Across four multi-hop QA benchmarks (HotpotQA, MuSiQue, 2WikiMultiHopQA, Bamboogle), MTR attains competitive Exact Match (EM) scores to live-API systems and excels on reasoning-intensive tasks, suggesting that effective tool reasoning can be learned from structured traces without live interactions.

Paper Structure

This paper contains 28 sections, 3 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: MTR Framework Components. (a) The ToolMaker operates through a systematic pipeline that combines predefined utility tools with task-specific tool generation, including task classification, interface generation, and validation checking. (b) Our training methodology transforms ToolMaker-generated interfaces into complete reasoning policies through systematic competence separation using SFT for structural competence and GRPO for strategic competence.
  • Figure 2: SFT training dynamics. Training and validation loss curves for Qwen2.5-7B-base and Qwen2.5-7B-Instruct with and without tool interfaces.
  • Figure 3: GRPO training dynamics for MTR-7B-Instruct. Training reward progression and response length evolution.
  • Figure 4: GRPO validation performance across benchmarks. EM and F1 scores during training.
  • Figure 5: Training stability analysis. Gradient norm trajectories and validation performance comparison.
  • ...and 3 more figures