Table of Contents
Fetching ...

ToolMind Technical Report: A Large-Scale, Reasoning-Enhanced Tool-Use Dataset

Chen Yang, Ran Le, Yun Xing, Zhenwei An, Zongchao Chen, Wayne Xin Zhao, Yang Song, Tao Zhang

TL;DR

ToolMind targets data scarcity in tool-use for LLMs by creating a large-scale, reasoning-enhanced dataset. It synthesizes 360k samples through graph-based function sampling and three-agent simulations, complemented by 200k augmented turns from open-source data, and enforces strict two-stage quality filtering to preserve high-quality reasoning traces. Empirical results show consistent gains across BFCL-v4, tau-bench, and tau2-bench after supervised fine-tuning, with ablations confirming the importance of graph sampling, turn-level filtering, and augmentation. The approach demonstrates that large-scale synthetic data with realistic tool-use dynamics can meaningfully improve LLM tool-use capabilities and offers a valuable resource for future research.

Abstract

Large Language Model (LLM) agents have developed rapidly in recent years to solve complex real-world problems using external tools. However, the scarcity of high-quality trajectories still hinders the development of stronger LLM agents. Most existing works on multi-turn dialogue synthesis validate correctness only at the trajectory level, which may overlook turn-level errors that can propagate during training and degrade model performance. To address these limitations, we introduce ToolMind, a large-scale, high-quality tool-agentic dataset with 160k synthetic data instances generated using over 20k tools and 200k augmented open-source data instances. Our data synthesis pipeline first constructs a function graph based on parameter correlations and then uses a multi-agent framework to simulate realistic user-assistant-tool interactions. Beyond trajectory-level validation, we employ fine-grained turn-level filtering to remove erroneous or suboptimal steps, ensuring that only high-quality reasoning traces are retained. This approach mitigates error amplification during training while preserving self-corrective reasoning signals essential for robust tool-use learning. Models fine-tuned on ToolMind show significant improvements over baselines on several benchmarks.

ToolMind Technical Report: A Large-Scale, Reasoning-Enhanced Tool-Use Dataset

TL;DR

ToolMind targets data scarcity in tool-use for LLMs by creating a large-scale, reasoning-enhanced dataset. It synthesizes 360k samples through graph-based function sampling and three-agent simulations, complemented by 200k augmented turns from open-source data, and enforces strict two-stage quality filtering to preserve high-quality reasoning traces. Empirical results show consistent gains across BFCL-v4, tau-bench, and tau2-bench after supervised fine-tuning, with ablations confirming the importance of graph sampling, turn-level filtering, and augmentation. The approach demonstrates that large-scale synthetic data with realistic tool-use dynamics can meaningfully improve LLM tool-use capabilities and offers a valuable resource for future research.

Abstract

Large Language Model (LLM) agents have developed rapidly in recent years to solve complex real-world problems using external tools. However, the scarcity of high-quality trajectories still hinders the development of stronger LLM agents. Most existing works on multi-turn dialogue synthesis validate correctness only at the trajectory level, which may overlook turn-level errors that can propagate during training and degrade model performance. To address these limitations, we introduce ToolMind, a large-scale, high-quality tool-agentic dataset with 160k synthetic data instances generated using over 20k tools and 200k augmented open-source data instances. Our data synthesis pipeline first constructs a function graph based on parameter correlations and then uses a multi-agent framework to simulate realistic user-assistant-tool interactions. Beyond trajectory-level validation, we employ fine-grained turn-level filtering to remove erroneous or suboptimal steps, ensuring that only high-quality reasoning traces are retained. This approach mitigates error amplification during training while preserving self-corrective reasoning signals essential for robust tool-use learning. Models fine-tuned on ToolMind show significant improvements over baselines on several benchmarks.

Paper Structure

This paper contains 25 sections, 4 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Performance on BFCL-v4, $\tau$-bench, and $\tau^2$-bench.
  • Figure 2: An illustration of the proposed data synthesis pipeline, with three key components: (1) Graph Construction and Function Chain Sampling, including function collection and refinement, function-graph construction, and random-walk sampling; (2) Multi-Agent Multi-Turn Trajectory Synthesis, using language models to simulate multiple roles and generate interactive trajectories; and (3) Quality Filtering, applying both trajectory-level and turn-level filtering to ensure data quality.
  • Figure 3: A distribution analysis of the proposed synthetic dataset.
  • Figure 4: Distribution of user intent domains, with tail ($\le 2\%$) grouped as "others".