SynthTools: A Framework for Scaling Synthetic Tools for Agent Development

Tommaso Castellani; Naimeng Ye; Daksh Mittal; Thomson Yen; Hongseok Namkoong

SynthTools: A Framework for Scaling Synthetic Tools for Agent Development

Tommaso Castellani, Naimeng Ye, Daksh Mittal, Thomson Yen, Hongseok Namkoong

TL;DR

SynthTools tackles the scalability and reliability bottlenecks of tool-use agent evaluation by introducing a three-component framework—Tool Generation, Tool Simulation, and Tool Audit—that builds large, diverse synthetic tool ecosystems across 100 domains. Its hierarchical domain evolution and embedding-based deduplication yield thousands of rich, non-redundant tools with realistic interfaces and failure modes, while a two-stage simulator and an LLM-based judge ensure consistent, trustworthy tool interactions. Empirical evaluations show high reliability: tool simulation accuracy reaches $\ge$ $97\%$ across internal tests and $94\%$ ACEBench alignment, and the audit judge attains $99\%$ accuracy with zero false positives, enabling scalable evaluation and training pipelines. The framework supports the creation of challenging downstream tasks—where SOTA models struggle—thereby providing a practical path toward large-scale training and stable evaluation of tool-use agents, with code available at the authors’ repository.

Abstract

AI agents increasingly rely on external tools to solve complex, long-horizon tasks. Advancing such agents requires reproducible evaluation and large-scale training in controllable, diverse, and realistic tool-use environments. However, real-world APIs are limited in availability, domain coverage, and stability, often requiring access keys and imposing rate limits, which render them impractical for stable evaluation or scalable training. To address these challenges, we introduce SynthTools, a flexible and scalable framework for generating synthetic tool ecosystems. Our framework consists of three core components: Tool Generation for automatic and scalable creation of diverse tools, Tool Simulation to emulate realistic tool behaviors, and Tool Audit to ensure correctness and consistency of tool simulation. To illustrate its scalability, we show that SynthTools can readily produce toolsets that span twice as many domains and twice as many tools per domain as prior work. Furthermore, the tool simulation and tool audit components demonstrate strong reliability, achieving $94\%$ and $99\%$ accuracy respectively. Finally, we construct downstream tasks from the generated tools that even state-of-the-art models struggle to complete. By enabling scalable, diverse, and reliable tool ecosystems, SynthTools provides a practical path toward large-scale training and stable evaluation of tool-use agents. Our code is available at https://github.com/namkoong-lab/SynthTools.

SynthTools: A Framework for Scaling Synthetic Tools for Agent Development

TL;DR

Abstract

SynthTools: A Framework for Scaling Synthetic Tools for Agent Development

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)