Table of Contents
Fetching ...

SynthTools: A Framework for Scaling Synthetic Tools for Agent Development

Tommaso Castellani, Naimeng Ye, Daksh Mittal, Thomson Yen, Hongseok Namkoong

TL;DR

SynthTools tackles the scalability and reliability bottlenecks of tool-use agent evaluation by introducing a three-component framework—Tool Generation, Tool Simulation, and Tool Audit—that builds large, diverse synthetic tool ecosystems across 100 domains. Its hierarchical domain evolution and embedding-based deduplication yield thousands of rich, non-redundant tools with realistic interfaces and failure modes, while a two-stage simulator and an LLM-based judge ensure consistent, trustworthy tool interactions. Empirical evaluations show high reliability: tool simulation accuracy reaches $\ge$ $97\%$ across internal tests and $94\%$ ACEBench alignment, and the audit judge attains $99\%$ accuracy with zero false positives, enabling scalable evaluation and training pipelines. The framework supports the creation of challenging downstream tasks—where SOTA models struggle—thereby providing a practical path toward large-scale training and stable evaluation of tool-use agents, with code available at the authors’ repository.

Abstract

AI agents increasingly rely on external tools to solve complex, long-horizon tasks. Advancing such agents requires reproducible evaluation and large-scale training in controllable, diverse, and realistic tool-use environments. However, real-world APIs are limited in availability, domain coverage, and stability, often requiring access keys and imposing rate limits, which render them impractical for stable evaluation or scalable training. To address these challenges, we introduce SynthTools, a flexible and scalable framework for generating synthetic tool ecosystems. Our framework consists of three core components: Tool Generation for automatic and scalable creation of diverse tools, Tool Simulation to emulate realistic tool behaviors, and Tool Audit to ensure correctness and consistency of tool simulation. To illustrate its scalability, we show that SynthTools can readily produce toolsets that span twice as many domains and twice as many tools per domain as prior work. Furthermore, the tool simulation and tool audit components demonstrate strong reliability, achieving $94\%$ and $99\%$ accuracy respectively. Finally, we construct downstream tasks from the generated tools that even state-of-the-art models struggle to complete. By enabling scalable, diverse, and reliable tool ecosystems, SynthTools provides a practical path toward large-scale training and stable evaluation of tool-use agents. Our code is available at https://github.com/namkoong-lab/SynthTools.

SynthTools: A Framework for Scaling Synthetic Tools for Agent Development

TL;DR

SynthTools tackles the scalability and reliability bottlenecks of tool-use agent evaluation by introducing a three-component framework—Tool Generation, Tool Simulation, and Tool Audit—that builds large, diverse synthetic tool ecosystems across 100 domains. Its hierarchical domain evolution and embedding-based deduplication yield thousands of rich, non-redundant tools with realistic interfaces and failure modes, while a two-stage simulator and an LLM-based judge ensure consistent, trustworthy tool interactions. Empirical evaluations show high reliability: tool simulation accuracy reaches across internal tests and ACEBench alignment, and the audit judge attains accuracy with zero false positives, enabling scalable evaluation and training pipelines. The framework supports the creation of challenging downstream tasks—where SOTA models struggle—thereby providing a practical path toward large-scale training and stable evaluation of tool-use agents, with code available at the authors’ repository.

Abstract

AI agents increasingly rely on external tools to solve complex, long-horizon tasks. Advancing such agents requires reproducible evaluation and large-scale training in controllable, diverse, and realistic tool-use environments. However, real-world APIs are limited in availability, domain coverage, and stability, often requiring access keys and imposing rate limits, which render them impractical for stable evaluation or scalable training. To address these challenges, we introduce SynthTools, a flexible and scalable framework for generating synthetic tool ecosystems. Our framework consists of three core components: Tool Generation for automatic and scalable creation of diverse tools, Tool Simulation to emulate realistic tool behaviors, and Tool Audit to ensure correctness and consistency of tool simulation. To illustrate its scalability, we show that SynthTools can readily produce toolsets that span twice as many domains and twice as many tools per domain as prior work. Furthermore, the tool simulation and tool audit components demonstrate strong reliability, achieving and accuracy respectively. Finally, we construct downstream tasks from the generated tools that even state-of-the-art models struggle to complete. By enabling scalable, diverse, and reliable tool ecosystems, SynthTools provides a practical path toward large-scale training and stable evaluation of tool-use agents. Our code is available at https://github.com/namkoong-lab/SynthTools.

Paper Structure

This paper contains 27 sections, 6 equations, 13 figures, 4 tables, 1 algorithm.

Figures (13)

  • Figure 1: Scalability, reliability, and diversity of SynthTools. Top Left: SynthTools achieves substantially higher scalability than existing benchmarks, both in the number of fields and the maximum tools supported per field. Note that Sum denotes the sum of all other benchmarks, discounting duplicate fields. Top Right: Both our tools generated, as well as the LLM judge built for tool audit, are highly reliable. Bottom Left: SynthTools cover more finer-grained domains. Bottom Right: SynthTools produces more tools per field, with richer and more nuanced capabilities.
  • Figure 2: An example of tool generated through our pipeline for the Financial Trading field.
  • Figure 3: An example of tool generation through hierarchical domain evolution procedure.
  • Figure 4: Distribution of tool embeddings across diverse field.
  • Figure 5: Scaling the number of tools within a field (e-commerce and retail): As we scale the number of tools within a field, they get more diverse rather than just producing duplicates. (left: 1 sub-domain, 110 tools, center: 3 sub-domains, 315 tools, right: 9 sub-domains, 933 tools)
  • ...and 8 more figures