Table of Contents
Fetching ...

ScaleEnv: Scaling Environment Synthesis from Scratch for Generalist Interactive Tool-Use Agent Training

Dunwei Tu, Hongyan Hao, Hansi Yang, Yihao Chen, Yi-Kai Zhang, Zhikang Xia, Yu Yang, Yueqing Sun, Xingchen Liu, Furao Shen, Qi Gu, Hui Su, Xunliang Cai

TL;DR

ScaleEnv delivers a fully automated framework to synthesize high-fidelity, interactive environments from scratch, addressing the scarcity and unreliability of existing training sandboxes. Its two-phase pipeline—Executable Graph Construction (tool and database schema definition, procedural testing, and a Tool Dependency Graph) and Task Instantiation via Graph Expansion (seed chains, distractors, and controlled expansion guided by LLMs)—produces verifiable tasks and rich environment states for agent training. Empirical results show strong zero-shot generalization to unseen domains on $\tau^2$-Bench and VitaBench, with performance scaling as environmental diversity increases, supporting a domain scaling curve as a data-centric prerequisite for robust generalist tool-use agents. The work emphasizes safety and efficiency via rule-based rewards and execution-based verification, offering a scalable sandbox for RL research and practical deployment while highlighting the importance of environmental diversity for robust reasoning and planning.

Abstract

Training generalist agents capable of adapting to diverse scenarios requires interactive environments for self-exploration. However, interactive environments remain critically scarce, and existing synthesis methods suffer from significant limitations regarding environmental diversity and scalability. To address these challenges, we introduce ScaleEnv, a framework that constructs fully interactive environments and verifiable tasks entirely from scratch. Specifically, ScaleEnv ensures environment reliability through procedural testing, and guarantees task completeness and solvability via tool dependency graph expansion and executable action verification. By enabling agents to learn through exploration within ScaleEnv, we demonstrate significant performance improvements on unseen, multi-turn tool-use benchmarks such as $τ^2$-Bench and VitaBench, highlighting strong generalization capabilities. Furthermore, we investigate the relationship between increasing number of domains and model generalization performance, providing empirical evidence that scaling environmental diversity is critical for robust agent learning.

ScaleEnv: Scaling Environment Synthesis from Scratch for Generalist Interactive Tool-Use Agent Training

TL;DR

ScaleEnv delivers a fully automated framework to synthesize high-fidelity, interactive environments from scratch, addressing the scarcity and unreliability of existing training sandboxes. Its two-phase pipeline—Executable Graph Construction (tool and database schema definition, procedural testing, and a Tool Dependency Graph) and Task Instantiation via Graph Expansion (seed chains, distractors, and controlled expansion guided by LLMs)—produces verifiable tasks and rich environment states for agent training. Empirical results show strong zero-shot generalization to unseen domains on -Bench and VitaBench, with performance scaling as environmental diversity increases, supporting a domain scaling curve as a data-centric prerequisite for robust generalist tool-use agents. The work emphasizes safety and efficiency via rule-based rewards and execution-based verification, offering a scalable sandbox for RL research and practical deployment while highlighting the importance of environmental diversity for robust reasoning and planning.

Abstract

Training generalist agents capable of adapting to diverse scenarios requires interactive environments for self-exploration. However, interactive environments remain critically scarce, and existing synthesis methods suffer from significant limitations regarding environmental diversity and scalability. To address these challenges, we introduce ScaleEnv, a framework that constructs fully interactive environments and verifiable tasks entirely from scratch. Specifically, ScaleEnv ensures environment reliability through procedural testing, and guarantees task completeness and solvability via tool dependency graph expansion and executable action verification. By enabling agents to learn through exploration within ScaleEnv, we demonstrate significant performance improvements on unseen, multi-turn tool-use benchmarks such as -Bench and VitaBench, highlighting strong generalization capabilities. Furthermore, we investigate the relationship between increasing number of domains and model generalization performance, providing empirical evidence that scaling environmental diversity is critical for robust agent learning.
Paper Structure (34 sections, 2 equations, 5 figures, 9 tables)

This paper contains 34 sections, 2 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Overview of Executable Graph Construction. The pipeline proceeds from left to right: (1) Schema Definition for tools and databases; (2) Implementation validated via procedural testing; and (3) Tool Dependency Graph Construction to model execution logic.
  • Figure 2: Overall pipeline of Task Instantiation via Graph Expansion. The process involves: (1) Seed Chain Sampling from the dependency graph; (2) Task Initialization with verifiable execution; and (3) Controlled Environment Expansion to scale complexity while maintaining solvability.
  • Figure 3: Domain Scaling Analysis (Pass@4). Comparison of zero-shot generalization as training domains scale from $N=2$ to $16$. $N=0$ denotes the base model. Performance improves monotonically across both benchmarks.
  • Figure 4: Visualization of Tool Embeddings across Domains. We use t-SNE maaten2008visualizing to project the semantic embeddings of tools from our 16 synthesized training domains (circles) and the evaluation benchmarks (crosses and pluses). The clear spatial separation between the training clusters and the $\tau^2$ / Vita domains empirically demonstrates the OOD nature of our evaluation.
  • Figure 5: Structural statistics of the 16 domains synthesized. The x-axis and y-axis represent the number of tools and database tables, respectively. The color intensity and bubble size indicate the Graph Density of the Tool Dependency Graph, reflecting the complexity of inter-tool causal relationships within each domain.