Table of Contents
Fetching ...

DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use

Aili Chen, Chi Zhang, Junteng Liu, Jiangjie Chen, Chengyu Du, Yunji Li, Ming Zhong, Qin Wang, Zhengmao Zhu, Jiayuan Song, Ke Ji, Junxian He, Pengyu Zhao, Yanghua Xiao

TL;DR

DIVE is proposed, an evidence-driven recipe that inverts synthesis order, executing diverse, real-world tools first and reverse-deriving tasks strictly entailed by the resulting traces, thereby providing grounding by construction in scaling diversity.

Abstract

Recent work synthesizes agentic tasks for post-training tool-using LLMs, yet robust generalization under shifts in tasks and toolsets remains an open challenge. We trace this brittleness to insufficient diversity in synthesized tasks. Scaling diversity is difficult because training requires tasks to remain executable and verifiable, while generalization demands coverage of diverse tool types, toolset combinations, and heterogeneous tool-use patterns. We propose DIVE, an evidence-driven recipe that inverts synthesis order, executing diverse, real-world tools first and reverse-deriving tasks strictly entailed by the resulting traces, thereby providing grounding by construction. DIVE scales structural diversity along two controllable axes, tool-pool coverage and per-task toolset variety, and an Evidence Collection--Task Derivation loop further induces rich multi-step tool-use patterns across 373 tools in five domains. Training Qwen3-8B on DIVE data (48k SFT + 3.2k RL) improves by +22 average points across 9 OOD benchmarks and outperforms the strongest 8B baseline by +68. Remarkably, controlled scaling analysis reveals that diversity scaling consistently outperforms quantity scaling for OOD generalization, even with 4x less data.

DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use

TL;DR

DIVE is proposed, an evidence-driven recipe that inverts synthesis order, executing diverse, real-world tools first and reverse-deriving tasks strictly entailed by the resulting traces, thereby providing grounding by construction in scaling diversity.

Abstract

Recent work synthesizes agentic tasks for post-training tool-using LLMs, yet robust generalization under shifts in tasks and toolsets remains an open challenge. We trace this brittleness to insufficient diversity in synthesized tasks. Scaling diversity is difficult because training requires tasks to remain executable and verifiable, while generalization demands coverage of diverse tool types, toolset combinations, and heterogeneous tool-use patterns. We propose DIVE, an evidence-driven recipe that inverts synthesis order, executing diverse, real-world tools first and reverse-deriving tasks strictly entailed by the resulting traces, thereby providing grounding by construction. DIVE scales structural diversity along two controllable axes, tool-pool coverage and per-task toolset variety, and an Evidence Collection--Task Derivation loop further induces rich multi-step tool-use patterns across 373 tools in five domains. Training Qwen3-8B on DIVE data (48k SFT + 3.2k RL) improves by +22 average points across 9 OOD benchmarks and outperforms the strongest 8B baseline by +68. Remarkably, controlled scaling analysis reveals that diversity scaling consistently outperforms quantity scaling for OOD generalization, even with 4x less data.
Paper Structure (38 sections, 4 equations, 8 figures, 11 tables)

This paper contains 38 sections, 4 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Motivation and overview of Dive.Top: Fixed-toolset synthesis and pipeline pooling limit diversity and weaken generalization. Middle: Simulated tools and query-first synthesis for diverse tasks increase unverifiability/unsolvability risk, limiting agentic training. Bottom:Dive performs evidence-first synthesis on diverse, real-world tools, producing verifiable and executable tasks. Radar: Gray: base model; Blue: trained on deep-research data synthesized with a fixed search/browse toolset (strong in-distribution but weak/negative transfer); Purple: trained on Dive with matched data and training budget (robust generalization).
  • Figure 2: Overview of the Dive framework.(1) Diverse Synthesis Resource Preparation (Left): We construct decoupled pools of tools (spanning general and expert domains), seed concepts, and query-only exemplars with implicit tool-use patterns. (2) Evidence-Driven Task Synthesis (Right): We randomly sample configurations and run an inverted loop where the model executes real tools to collect grounded evidence (a, b) and reverse-derives tasks (query-answer pairs) strictly entailed by traces (c, d), ensuring validity by construction. (3) Agentic Training (Bottom): The synthesized corpus supports effective SFT cold starts and RL using verifiable reference answers.
  • Figure 3: Diversity-only vs. Quantity-only
  • Figure 4: Variety-only vs. Pool-Exp+Variety
  • Figure 5: All-Path Scaling: SFT $\to$ RL
  • ...and 3 more figures