Table of Contents
Fetching ...

APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay

Akshara Prabhakar, Zuxin Liu, Ming Zhu, Jianguo Zhang, Tulika Awalgaonkar, Shiyu Wang, Zhiwei Liu, Haolin Chen, Thai Hoang, Juan Carlos Niebles, Shelby Heinecke, Weiran Yao, Huan Wang, Silvio Savarese, Caiming Xiong

TL;DR

APIGen-MT presents a two-phase, blueprint-driven approach to synthesize high-quality multi-turn agent data using an agentic feedback loop and simulated human-agent interplay. By separating task configuration from dialogue generation, the method yields verifiable tasks and realistic interaction trajectories, enabling effective training of large and small models alike. Empirical results on $\tau$-bench and BFCL show competitive or superior performance compared to frontier models, with notable gains in multi-turn settings and improved data-quality and stability. The work also contributes open-source 5K synthetic trajectories and trained xLAM-2-fc-r models to accelerate research in AI agents.

Abstract

Training effective AI agents for multi-turn interactions requires high-quality data that captures realistic human-agent dynamics, yet such data is scarce and expensive to collect manually. We introduce APIGen-MT, a two-phase framework that generates verifiable and diverse multi-turn agent data. In the first phase, our agentic pipeline produces detailed task blueprints with ground-truth actions, leveraging a committee of LLM reviewers and iterative feedback loops. These blueprints are then transformed into complete interaction trajectories through simulated human-agent interplay. We train a family of models -- the xLAM-2-fc-r series with sizes ranging from 1B to 70B parameters. Our models outperform frontier models such as GPT-4o and Claude 3.5 on $τ$-bench and BFCL benchmarks, with the smaller models surpassing their larger counterparts, particularly in multi-turn settings, while maintaining superior consistency across multiple trials. Comprehensive experiments demonstrate that our verified blueprint-to-details approach yields high-quality training data, enabling the development of more reliable, efficient, and capable agents. We open-source 5K synthetic data trajectories and the trained xLAM-2-fc-r models to advance research in AI agents. Models at https://huggingface.co/collections/Salesforce/xlam-2-67ef5be12949d8dcdae354c4; Dataset at https://huggingface.co/datasets/Salesforce/APIGen-MT-5k and Website at https://apigen-mt.github.io

APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay

TL;DR

APIGen-MT presents a two-phase, blueprint-driven approach to synthesize high-quality multi-turn agent data using an agentic feedback loop and simulated human-agent interplay. By separating task configuration from dialogue generation, the method yields verifiable tasks and realistic interaction trajectories, enabling effective training of large and small models alike. Empirical results on -bench and BFCL show competitive or superior performance compared to frontier models, with notable gains in multi-turn settings and improved data-quality and stability. The work also contributes open-source 5K synthetic trajectories and trained xLAM-2-fc-r models to accelerate research in AI agents.

Abstract

Training effective AI agents for multi-turn interactions requires high-quality data that captures realistic human-agent dynamics, yet such data is scarce and expensive to collect manually. We introduce APIGen-MT, a two-phase framework that generates verifiable and diverse multi-turn agent data. In the first phase, our agentic pipeline produces detailed task blueprints with ground-truth actions, leveraging a committee of LLM reviewers and iterative feedback loops. These blueprints are then transformed into complete interaction trajectories through simulated human-agent interplay. We train a family of models -- the xLAM-2-fc-r series with sizes ranging from 1B to 70B parameters. Our models outperform frontier models such as GPT-4o and Claude 3.5 on -bench and BFCL benchmarks, with the smaller models surpassing their larger counterparts, particularly in multi-turn settings, while maintaining superior consistency across multiple trials. Comprehensive experiments demonstrate that our verified blueprint-to-details approach yields high-quality training data, enabling the development of more reliable, efficient, and capable agents. We open-source 5K synthetic data trajectories and the trained xLAM-2-fc-r models to advance research in AI agents. Models at https://huggingface.co/collections/Salesforce/xlam-2-67ef5be12949d8dcdae354c4; Dataset at https://huggingface.co/datasets/Salesforce/APIGen-MT-5k and Website at https://apigen-mt.github.io

Paper Structure

This paper contains 29 sections, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Comparative performance of larger xLAM-2-fc-r models (8B-70B, trained with APIGen-MT data) against state-of-the-art baselines on function-calling (BFCL v3 berkeley-function-calling-leaderboard) and agentic ($\tau$-bench yao2024tau) capabilities.
  • Figure 2: Overview of the APIGen-MT framework. Phase 1 generates task configurations and groundtruth actions through an agentic process with feedback loops. Phase 2 collects human-agent-environment interaction trajectories by simulating realistic conversations between a human user and a test agent in an executable environment.
  • Figure 3: Realization of APIGen-MT framework for $\tau$-bench. We first generate realistic task instances by random walk down the API graph and sampling. Next the tasks are validated following a multi-stage pipeline. Instances which fail are sent back to the Generator to be refined based on the validation feedback. Finally, trajectories are generated by a simulated human user that interacts with a test agent by supplying the query details in a turn-wise manner. Trajectories which pass state- and output- based evaluations are collected.
  • Figure 4: Statistics for the dataset generated using APIGen-MT. Success rates (S.R.) are reported for the task configuration (w. and w/o agentic feedback in Phase 1) and trajectory simulation (Phase 2) stages.
  • Figure 5: Density distribution of assistant and user turns in collected trajectories.
  • ...and 7 more figures