EnterpriseBench Corecraft: Training Generalizable Agents on High-Fidelity RL Environments
Sushant Mehta, Logan Ritchie, Suhaas Garre, Nick Heiner, Edwin Chen
TL;DR
The paper argues that training AI agents in high-fidelity enterprise environments yields stronger generalization to real-world tasks. It introduces Corecraft, an enterprise-scale RL environment with task diversity, expert rubrics, and realistic workflows, and demonstrates that rubric-based RL with Group Relative Policy Optimization and adaptive clipping yields substantial held-out improvements and transfers to external benchmarks. Key findings include an 11.39 percentage point gain on Corecraft tasks after one epoch and transfers of +4.5% on BFCL Parallel, +7.4% on tau^2-Bench Retail, and +6.8% on Toolathlon, driven by improved multi-step workflow execution, constraint handling, and response quality. The results underscore that environment quality, diversity, and realism drive generalization, suggesting that future work on extended training and multi-domain curricula can further close the gap to real-world deployment.
Abstract
We show that training AI agents on high-fidelity reinforcement learning environments produces capabilities that generalize beyond the training distribution. We introduce CoreCraft, the first environment in EnterpriseBench, Surge AI's suite of agentic RL environments. CoreCraft is a fully operational enterprise simulation of a customer support organization, comprising over 2,500 entities across 14 entity types with 23 unique tools, designed to measure whether AI agents can perform the multi-step, domain-specific work that real jobs demand. Frontier models such as GPT-5.2 and Claude Opus 4.6 solve fewer than 30% of tasks when all expert-authored rubric criteria must be satisfied. Using this environment, we train GLM 4.6 with Group Relative Policy Optimization (GRPO) and adaptive clipping. After a single epoch of training, the model improves from 25.37% to 36.76% task pass rate on held-out evaluation tasks. More importantly, these gains transfer to out-of-distribution benchmarks: +4.5% on BFCL Parallel, +7.4% on Tau2-Bench Retail, and +6.8% on Tool Decathlon (Pass@1). We believe three environment properties are consistent with the observed transfer: task-centric world building that optimizes for diverse, challenging tasks; expert-authored rubrics enabling reliable reward computation; and enterprise workflows that reflect realistic professional patterns. Our results suggest that environment quality, diversity, and realism are key factors enabling generalizable agent capabilities.
