Table of Contents
Fetching ...

ToolForge: A Data Synthesis Pipeline for Multi-Hop Search without Real-World APIs

Hao Chen, Zhexin Hu, Jiajun Chai, Haocheng Yang, Hang He, Xiaohan Wang, Wei Lin, Luhang Wang, Guojun Yin, Zhuofeng zhao

TL;DR

ToolForge presents a scalable synthetic-data framework for training tool-augmented LLMs on multi-hop search without real APIs, using 19 virtual tools and a (question, golden context, answer) triple to generate diverse, reflective reasoning trajectories. A Generative Interaction Modeling module creates 29 reasoning-tool interaction patterns, while a Multi-Layer Validation framework ensures data fidelity through rule-based and model-based checks with hard-negative mining. An 8B-parameter model trained solely on this synthetic data achieves state-of-the-art results across in-domain and out-of-domain benchmarks, often surpassing GPT-4o and matching larger models in tool-calling tasks. ToolForge’s plug-and-play design and radical reduction in API-dependency offer a practical path to scalable, verification-aware data synthesis for tool-augmented LLMs.

Abstract

Training LLMs to invoke tools and leverage retrieved information necessitates high-quality, diverse data. However, existing pipelines for synthetic data generation often rely on tens of thousands of real API calls to enhance generalization, incurring prohibitive costs while lacking multi-hop reasoning and self-reflection. To address these limitations, we introduce ToolForge, an automated synthesis framework that achieves strong real-world tool-calling performance by constructing only a small number of virtual tools, eliminating the need for real API calls. ToolForge leverages a (question, golden context, answer) triple to synthesize large-scale tool-learning data specifically designed for multi-hop search scenarios, further enriching the generated data through multi-hop reasoning and self-reflection mechanisms. To ensure data fidelity, we employ a Multi-Layer Validation Framework that integrates both rule-based and model-based assessments. Empirical results show that a model with only 8B parameters, when trained on our synthesized data, outperforms GPT-4o on multiple benchmarks. Our code and dataset are publicly available at https://github.com/Buycar-arb/ToolForge .

ToolForge: A Data Synthesis Pipeline for Multi-Hop Search without Real-World APIs

TL;DR

ToolForge presents a scalable synthetic-data framework for training tool-augmented LLMs on multi-hop search without real APIs, using 19 virtual tools and a (question, golden context, answer) triple to generate diverse, reflective reasoning trajectories. A Generative Interaction Modeling module creates 29 reasoning-tool interaction patterns, while a Multi-Layer Validation framework ensures data fidelity through rule-based and model-based checks with hard-negative mining. An 8B-parameter model trained solely on this synthetic data achieves state-of-the-art results across in-domain and out-of-domain benchmarks, often surpassing GPT-4o and matching larger models in tool-calling tasks. ToolForge’s plug-and-play design and radical reduction in API-dependency offer a practical path to scalable, verification-aware data synthesis for tool-augmented LLMs.

Abstract

Training LLMs to invoke tools and leverage retrieved information necessitates high-quality, diverse data. However, existing pipelines for synthetic data generation often rely on tens of thousands of real API calls to enhance generalization, incurring prohibitive costs while lacking multi-hop reasoning and self-reflection. To address these limitations, we introduce ToolForge, an automated synthesis framework that achieves strong real-world tool-calling performance by constructing only a small number of virtual tools, eliminating the need for real API calls. ToolForge leverages a (question, golden context, answer) triple to synthesize large-scale tool-learning data specifically designed for multi-hop search scenarios, further enriching the generated data through multi-hop reasoning and self-reflection mechanisms. To ensure data fidelity, we employ a Multi-Layer Validation Framework that integrates both rule-based and model-based assessments. Empirical results show that a model with only 8B parameters, when trained on our synthesized data, outperforms GPT-4o on multiple benchmarks. Our code and dataset are publicly available at https://github.com/Buycar-arb/ToolForge .

Paper Structure

This paper contains 31 sections, 4 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: The overall framework of ToolForge, which mainly consists of Knowledge Space Preparation (KSP), Generative Interaction Modeling (GIM), and Multi-Layer Validation (MLV).
  • Figure 2: Illustration of the four tool-calling paradigms in ToolForge. While this figure illustrates the simplest form of each tool-calling paradigm for clarity, the full ToolForge dataset features instances with far more complex logic structures.
  • Figure 3: Scenario tree with three possible outcomes for an intermediate step within a single turn: (i) correct tool-calling, (ii) tool misselection, and (iii) argument misselection. The important note shows two tool-switching cases across a two-round dialogue.
  • Figure 4: Effect of Single-hop and Multi-hop Data Ratios.
  • Figure 5: Experimental results of ablations on virtual tool design.
  • ...and 1 more figures