Table of Contents
Fetching ...

TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments

Zhangchen Xu, Adriana Meza Soria, Shawn Tan, Anurag Roy, Ashish Sunil Agrawal, Radha Poovendran, Rameswar Panda

TL;DR

Toucan delivers a first-of-its-kind, open-source dataset of 1.5 million tool-agent trajectories sourced from nearly 500 real MCP servers, addressing a critical gap in permissively licensed training data for LLM agents. The authors design a robust five-stage generation pipeline with five task-generation models, three teacher models for trajectories, and two agent frameworks, augmented by three extensions to boost diversity and realism. Experiments show Toucan-tuned models outperform comparable baselines on BFCL V3, tau-Bench, tau2-Bench, and MCP-Universe benchmarks, demonstrating improved tool selection, tool execution fidelity, and multi-turn reasoning. The work emphasizes reproducibility and ethical considerations, and outlines future plans to broaden MCP coverage, explore tool-response experts, and develop web-search focused MCP benchmarks, positioning Toucan as a scalable foundation for open-source agentic AI research.

Abstract

Large Language Model (LLM) agents are rapidly emerging as powerful systems for automating tasks across domains. Yet progress in the open-source community is constrained by the lack of high quality permissively licensed tool-agentic training data. Existing datasets are often limited in diversity, realism, and complexity, particularly regarding multi-tool and multi-turn interactions. To address this gap, we introduce Toucan, the largest publicly available tool-agentic dataset to date, containing 1.5 million trajectories synthesized from nearly 500 real-world Model Context Protocols (MCPs). Unlike prior work, Toucan leverages authentic MCP environments to generate diverse, realistic, and challenging tasks with trajectories involving real tool execution. Our pipeline first produces a broad spectrum of tool-use queries using five distinct models, applies model-based quality filtering, and then generates agentic trajectories with three teacher models using two agentic frameworks. Rigorous rule-based and model-based validation ensures high-quality outputs. We also introduce three extension mechanisms to further diversify tasks and simulate multi-turn conversations. Models fine-tuned on Toucan outperform larger closed-source counterparts on the BFCL V3 benchmark and push the Pareto frontier forward on MCP-Universe Bench.

TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments

TL;DR

Toucan delivers a first-of-its-kind, open-source dataset of 1.5 million tool-agent trajectories sourced from nearly 500 real MCP servers, addressing a critical gap in permissively licensed training data for LLM agents. The authors design a robust five-stage generation pipeline with five task-generation models, three teacher models for trajectories, and two agent frameworks, augmented by three extensions to boost diversity and realism. Experiments show Toucan-tuned models outperform comparable baselines on BFCL V3, tau-Bench, tau2-Bench, and MCP-Universe benchmarks, demonstrating improved tool selection, tool execution fidelity, and multi-turn reasoning. The work emphasizes reproducibility and ethical considerations, and outlines future plans to broaden MCP coverage, explore tool-response experts, and develop web-search focused MCP benchmarks, positioning Toucan as a scalable foundation for open-source agentic AI research.

Abstract

Large Language Model (LLM) agents are rapidly emerging as powerful systems for automating tasks across domains. Yet progress in the open-source community is constrained by the lack of high quality permissively licensed tool-agentic training data. Existing datasets are often limited in diversity, realism, and complexity, particularly regarding multi-tool and multi-turn interactions. To address this gap, we introduce Toucan, the largest publicly available tool-agentic dataset to date, containing 1.5 million trajectories synthesized from nearly 500 real-world Model Context Protocols (MCPs). Unlike prior work, Toucan leverages authentic MCP environments to generate diverse, realistic, and challenging tasks with trajectories involving real tool execution. Our pipeline first produces a broad spectrum of tool-use queries using five distinct models, applies model-based quality filtering, and then generates agentic trajectories with three teacher models using two agentic frameworks. Rigorous rule-based and model-based validation ensures high-quality outputs. We also introduce three extension mechanisms to further diversify tasks and simulate multi-turn conversations. Models fine-tuned on Toucan outperform larger closed-source counterparts on the BFCL V3 benchmark and push the Pareto frontier forward on MCP-Universe Bench.

Paper Structure

This paper contains 29 sections, 13 figures, 6 tables.

Figures (13)

  • Figure 1: MCP servers filtering process
  • Figure 2: The Toucan construction pipeline: A systematic five-stage process from MCP server onboarding through trajectory filtering, with three extensions for enhancing data diversity and realism.
  • Figure 3: MCP servers distribution by domain, covering a wide range of categories. Values in parentheses indicate the number of servers belonging to each category.
  • Figure 4: The figures above illustrate the Toucan dataset analysis. Subfigure (a) and (b) provide statistics on the number of servers and required tools per instance, highlighting Toucan 's comprehensive coverage of multi-server and multi-tool tasks. Subfigures (c) and (d) reveal that most tasks include more tools in the context than the targeted tools, underscoring the non-trivial tool selection challenges. Subfigure (e) displays the length of user messages in tokens. Subfigures (f) and (h) demonstrate the multi-turn nature of the tasks, characterized by extended and diverse interactions among users, agents, and tools. Subfigure (g) demonstrates that Toucan encompasses both single and parallel tool calls, which enhance the dataset's versatility in capturing diverse agent-tool interaction patterns.
  • Figure 5: Toucan Subset Statistics
  • ...and 8 more figures