Table of Contents
Fetching ...

EnterpriseBench Corecraft: Training Generalizable Agents on High-Fidelity RL Environments

Sushant Mehta, Logan Ritchie, Suhaas Garre, Nick Heiner, Edwin Chen

TL;DR

The paper argues that training AI agents in high-fidelity enterprise environments yields stronger generalization to real-world tasks. It introduces Corecraft, an enterprise-scale RL environment with task diversity, expert rubrics, and realistic workflows, and demonstrates that rubric-based RL with Group Relative Policy Optimization and adaptive clipping yields substantial held-out improvements and transfers to external benchmarks. Key findings include an 11.39 percentage point gain on Corecraft tasks after one epoch and transfers of +4.5% on BFCL Parallel, +7.4% on tau^2-Bench Retail, and +6.8% on Toolathlon, driven by improved multi-step workflow execution, constraint handling, and response quality. The results underscore that environment quality, diversity, and realism drive generalization, suggesting that future work on extended training and multi-domain curricula can further close the gap to real-world deployment.

Abstract

We show that training AI agents on high-fidelity reinforcement learning environments produces capabilities that generalize beyond the training distribution. We introduce CoreCraft, the first environment in EnterpriseBench, Surge AI's suite of agentic RL environments. CoreCraft is a fully operational enterprise simulation of a customer support organization, comprising over 2,500 entities across 14 entity types with 23 unique tools, designed to measure whether AI agents can perform the multi-step, domain-specific work that real jobs demand. Frontier models such as GPT-5.2 and Claude Opus 4.6 solve fewer than 30% of tasks when all expert-authored rubric criteria must be satisfied. Using this environment, we train GLM 4.6 with Group Relative Policy Optimization (GRPO) and adaptive clipping. After a single epoch of training, the model improves from 25.37% to 36.76% task pass rate on held-out evaluation tasks. More importantly, these gains transfer to out-of-distribution benchmarks: +4.5% on BFCL Parallel, +7.4% on Tau2-Bench Retail, and +6.8% on Tool Decathlon (Pass@1). We believe three environment properties are consistent with the observed transfer: task-centric world building that optimizes for diverse, challenging tasks; expert-authored rubrics enabling reliable reward computation; and enterprise workflows that reflect realistic professional patterns. Our results suggest that environment quality, diversity, and realism are key factors enabling generalizable agent capabilities.

EnterpriseBench Corecraft: Training Generalizable Agents on High-Fidelity RL Environments

TL;DR

The paper argues that training AI agents in high-fidelity enterprise environments yields stronger generalization to real-world tasks. It introduces Corecraft, an enterprise-scale RL environment with task diversity, expert rubrics, and realistic workflows, and demonstrates that rubric-based RL with Group Relative Policy Optimization and adaptive clipping yields substantial held-out improvements and transfers to external benchmarks. Key findings include an 11.39 percentage point gain on Corecraft tasks after one epoch and transfers of +4.5% on BFCL Parallel, +7.4% on tau^2-Bench Retail, and +6.8% on Toolathlon, driven by improved multi-step workflow execution, constraint handling, and response quality. The results underscore that environment quality, diversity, and realism drive generalization, suggesting that future work on extended training and multi-domain curricula can further close the gap to real-world deployment.

Abstract

We show that training AI agents on high-fidelity reinforcement learning environments produces capabilities that generalize beyond the training distribution. We introduce CoreCraft, the first environment in EnterpriseBench, Surge AI's suite of agentic RL environments. CoreCraft is a fully operational enterprise simulation of a customer support organization, comprising over 2,500 entities across 14 entity types with 23 unique tools, designed to measure whether AI agents can perform the multi-step, domain-specific work that real jobs demand. Frontier models such as GPT-5.2 and Claude Opus 4.6 solve fewer than 30% of tasks when all expert-authored rubric criteria must be satisfied. Using this environment, we train GLM 4.6 with Group Relative Policy Optimization (GRPO) and adaptive clipping. After a single epoch of training, the model improves from 25.37% to 36.76% task pass rate on held-out evaluation tasks. More importantly, these gains transfer to out-of-distribution benchmarks: +4.5% on BFCL Parallel, +7.4% on Tau2-Bench Retail, and +6.8% on Tool Decathlon (Pass@1). We believe three environment properties are consistent with the observed transfer: task-centric world building that optimizes for diverse, challenging tasks; expert-authored rubrics enabling reliable reward computation; and enterprise workflows that reflect realistic professional patterns. Our results suggest that environment quality, diversity, and realism are key factors enabling generalizable agent capabilities.
Paper Structure (52 sections, 1 equation, 1 figure, 6 tables)

This paper contains 52 sections, 1 equation, 1 figure, 6 tables.

Figures (1)

  • Figure 1: Architecture of the RL training loop with Corecraft. The rollout engine generates agent responses, routing tool calls to stateful Docker containers running the MCP server. Completed trajectories are evaluated by an LLM judge against task rubrics, with rewards flowing back to the training loop.