TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments

Zhangchen Xu; Adriana Meza Soria; Shawn Tan; Anurag Roy; Ashish Sunil Agrawal; Radha Poovendran; Rameswar Panda

TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments

Zhangchen Xu, Adriana Meza Soria, Shawn Tan, Anurag Roy, Ashish Sunil Agrawal, Radha Poovendran, Rameswar Panda

TL;DR

Toucan delivers a first-of-its-kind, open-source dataset of 1.5 million tool-agent trajectories sourced from nearly 500 real MCP servers, addressing a critical gap in permissively licensed training data for LLM agents. The authors design a robust five-stage generation pipeline with five task-generation models, three teacher models for trajectories, and two agent frameworks, augmented by three extensions to boost diversity and realism. Experiments show Toucan-tuned models outperform comparable baselines on BFCL V3, tau-Bench, tau2-Bench, and MCP-Universe benchmarks, demonstrating improved tool selection, tool execution fidelity, and multi-turn reasoning. The work emphasizes reproducibility and ethical considerations, and outlines future plans to broaden MCP coverage, explore tool-response experts, and develop web-search focused MCP benchmarks, positioning Toucan as a scalable foundation for open-source agentic AI research.

Abstract

Large Language Model (LLM) agents are rapidly emerging as powerful systems for automating tasks across domains. Yet progress in the open-source community is constrained by the lack of high quality permissively licensed tool-agentic training data. Existing datasets are often limited in diversity, realism, and complexity, particularly regarding multi-tool and multi-turn interactions. To address this gap, we introduce Toucan, the largest publicly available tool-agentic dataset to date, containing 1.5 million trajectories synthesized from nearly 500 real-world Model Context Protocols (MCPs). Unlike prior work, Toucan leverages authentic MCP environments to generate diverse, realistic, and challenging tasks with trajectories involving real tool execution. Our pipeline first produces a broad spectrum of tool-use queries using five distinct models, applies model-based quality filtering, and then generates agentic trajectories with three teacher models using two agentic frameworks. Rigorous rule-based and model-based validation ensures high-quality outputs. We also introduce three extension mechanisms to further diversify tasks and simulate multi-turn conversations. Models fine-tuned on Toucan outperform larger closed-source counterparts on the BFCL V3 benchmark and push the Pareto frontier forward on MCP-Universe Bench.

TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments

TL;DR

Abstract

TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)