Table of Contents
Fetching ...

EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings

Shiva Krishna Reddy Malay, Shravan Nayak, Jishnu Sethumadhavan Nair, Sagar Davasam, Aman Tiwari, Sathwik Tejaswi Madhusudhan, Sridhar Krishna Nemala, Srinivas Sunkara, Sai Rajeswar

Abstract

Large language models are shifting from passive information providers to active agents intended for complex workflows. However, their deployment as reliable AI workers in enterprise is stalled by benchmarks that fail to capture the intricacies of professional environments, specifically, the need for long-horizon planning amidst persistent state changes and strict access protocols. In this work, we introduce EnterpriseOps-Gym, a benchmark designed to evaluate agentic planning in realistic enterprise settings. Specifically, EnterpriseOps-Gym features a containerized sandbox with 164 database tables and 512 functional tools to mimic real-world search friction. Within this environment, agents are evaluated on 1,150 expert-curated tasks across eight mission-critical verticals (including Customer Service, HR, and IT). Our evaluation of 14 frontier models reveals critical limitations in state-of-the-art models: the top-performing Claude Opus 4.5 achieves only 37.4% success. Further analysis shows that providing oracle human plans improves performance by 14-35 percentage points, pinpointing strategic reasoning as the primary bottleneck. Additionally, agents frequently fail to refuse infeasible tasks (best model achieves 53.9%), leading to unintended and potentially harmful side effects. Our findings underscore that current agents are not yet ready for autonomous enterprise deployment. More broadly, EnterpriseOps-Gym provides a concrete testbed to advance the robustness of agentic planning in professional workflows.

EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings

Abstract

Large language models are shifting from passive information providers to active agents intended for complex workflows. However, their deployment as reliable AI workers in enterprise is stalled by benchmarks that fail to capture the intricacies of professional environments, specifically, the need for long-horizon planning amidst persistent state changes and strict access protocols. In this work, we introduce EnterpriseOps-Gym, a benchmark designed to evaluate agentic planning in realistic enterprise settings. Specifically, EnterpriseOps-Gym features a containerized sandbox with 164 database tables and 512 functional tools to mimic real-world search friction. Within this environment, agents are evaluated on 1,150 expert-curated tasks across eight mission-critical verticals (including Customer Service, HR, and IT). Our evaluation of 14 frontier models reveals critical limitations in state-of-the-art models: the top-performing Claude Opus 4.5 achieves only 37.4% success. Further analysis shows that providing oracle human plans improves performance by 14-35 percentage points, pinpointing strategic reasoning as the primary bottleneck. Additionally, agents frequently fail to refuse infeasible tasks (best model achieves 53.9%), leading to unintended and potentially harmful side effects. Our findings underscore that current agents are not yet ready for autonomous enterprise deployment. More broadly, EnterpriseOps-Gym provides a concrete testbed to advance the robustness of agentic planning in professional workflows.
Paper Structure (66 sections, 7 figures, 5 tables)

This paper contains 66 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Performance–cost tradeoff for agentic tool use on EnterpriseOps-Gym We plot task success rate against estimated cost per task for both closed-source and open-source models. Open-source models incur lower cost but achieve consistently lower success rates. While higher-cost models offer modest performance gains, they remain far below reliable task completion.
  • Figure 2: Overview of EnterpriseOps-Gym: A benchmark for stateful agentic planning and tool use. (Top-left) EnterpriseOps-Gym spans eight enterprise domains and evaluates multi-step agentic planning, policy adherence, state-driven tool calling and cross domain orchestration in a reproducible sandbox. (Top-right) Domain experts create sandox and author realistic single- and cross-domain tasks, execute ground-truth trajectories, and write outcome-based verification logic with multi-stage quality assurance along with a human written oracle plan for completing the task. (Bottom) Given a task and constraints (as system level policies), agents interact with the environment and execute tools. They are evaluated by final-state verifiers that check goal completion, policy compliance, and side effects. In the above example the agent fails to adhere to system policy for linking case knowledge which mandates setting a parameter to "suggested" when a knowledge base is automatically discovered. Furthermore, the agent fails to properly send an email notification due to an unresolved identifier for the given case.
  • Figure 3: Task distribution across eight EnterpriseOps-Gym domains.
  • Figure 4: Performance degrades consistently with planning horizon. Pass@1 accuracy for closed-source (solid) and open-weight (dashed) models across horizon lengths 4–16. Thick lines show the group mean $\pm$1 SE. We observe monotonic degradation of performance for both sets, while open model performance falls more sharply with horizon length.
  • Figure 5: Impact of thinking budget on performance Histograms show the performance numbers with thinking budget with GPT-OSS-120B model openai2025gptoss120bgptoss20bmodel across domains. The results show that the model with low thinking budget performing poorly with performance steadily increasing with thinking budget.
  • ...and 2 more figures