APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets
Zuxin Liu, Thai Hoang, Jianguo Zhang, Ming Zhu, Tian Lan, Shirley Kokane, Juntao Tan, Weiran Yao, Zhiwei Liu, Yihao Feng, Rithesh Murthy, Liangwei Yang, Silvio Savarese, Juan Carlos Niebles, Huan Wang, Shelby Heinecke, Caiming Xiong
TL;DR
APIGen presents an automated, verifiable pipeline for generating diverse function-calling datasets by harvesting thousands of executable APIs, enforcing a three-stage data verification (format, execution, semantic), and promoting query-style and API diversity. The authors demonstrate that models trained on APIGen data, including 1.3B and 6.7B variants, achieve competitive BFCL performance, with the 6.7B model ranking 6th on BFCL and surpassing several larger models. A human-evaluation and ablation study further validate data quality and the necessity of the verification stages. The work releases a 60k-entry dataset across 21 categories and 3,673 APIs, highlighting that high-quality synthetic data can empower smaller models in tool-use and function-calling tasks, while outlining future extensions to broader APIs and multi-turn interactions.
Abstract
The advancement of function-calling agent models requires diverse, reliable, and high-quality datasets. This paper presents APIGen, an automated data generation pipeline designed to synthesize verifiable high-quality datasets for function-calling applications. We leverage APIGen and collect 3,673 executable APIs across 21 different categories to generate diverse function-calling datasets in a scalable and structured manner. Each data in our dataset is verified through three hierarchical stages: format checking, actual function executions, and semantic verification, ensuring its reliability and correctness. We demonstrate that models trained with our curated datasets, even with only 7B parameters, can achieve state-of-the-art performance on the Berkeley Function-Calling Benchmark, outperforming multiple GPT-4 models. Moreover, our 1B model achieves exceptional performance, surpassing GPT-3.5-Turbo and Claude-3 Haiku. We release a dataset containing 60,000 high-quality entries, aiming to advance the field of function-calling agent domains. The dataset is available on Huggingface: https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k and the project homepage: https://apigen-pipeline.github.io/
