Table of Contents
Fetching ...

Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation

Hubert M. Pysklo, Artem Zhuravel, Patrick D. Watson

TL;DR

Agent-Diff introduces a state-diff, code-execution benchmark for enterprise API tasks that fuses ecological validity with reproducibility by sandboxing external services. It models each service as a state machine and snapshots pre/post environments to compute diffs $ riangle S$, enabling declarative, predicate-based verification and a closed-world invariant. Across 224 tasks and nine models, the framework reveals nuanced effects of API documentation and recovery strategies, with top models favoring proactive, info-seeking approaches and longer-horizon planning. The work provides a MIT-licensed codebase and a comprehensive methodology for evaluating long-horizon, API-driven agents in realistic settings, highlighting practical insights for improving robustness and transfer of learning to real-world enterprise tools.

Abstract

We present Agent-Diff, a novel benchmarking framework for evaluating agentic Large Language Models (LLMs) on real-world tasks that execute code via external APIs. Agentic LLM performance varies due to differences in models, external tool access, prompt structures, and agentic frameworks. Benchmarks must make fundamental trade-offs between a sandboxed approach that controls for variation in software environments and more ecologically valid approaches employing real services. Agent-Diff attempts to capture the desirable features of both of these approaches by including access to the real API interfaces for software services while sandboxing the environment in which calls are made, processed, and evaluated. This approach relies on two key innovations. The first is a novel state-diff contract, which separates process from outcome - rather than fuzzy trace or parameter matching, we define task success as whether the expected change in environment state was achieved. The second is a novel sandbox that provides a standardized scripting layer that all models use to execute code against external APIs (Slack, Box, Linear, Google Calendar). Thus, we can evaluate different agentic LLMs against a standardized set of contracts using a unified sandbox while still evaluating their performance on real-world service interfaces. Using the Agent-Diff framework, we provide benchmarks for nine LLMs across 224 tasks utilizing enterprise software workflows. In addition, we evaluate the robustness of the framework with ablation experiments to assess the contribution of access to API documentation on benchmark performance. Code and data: https://github.com/agent-diff-bench/agent-diff.

Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation

TL;DR

Agent-Diff introduces a state-diff, code-execution benchmark for enterprise API tasks that fuses ecological validity with reproducibility by sandboxing external services. It models each service as a state machine and snapshots pre/post environments to compute diffs , enabling declarative, predicate-based verification and a closed-world invariant. Across 224 tasks and nine models, the framework reveals nuanced effects of API documentation and recovery strategies, with top models favoring proactive, info-seeking approaches and longer-horizon planning. The work provides a MIT-licensed codebase and a comprehensive methodology for evaluating long-horizon, API-driven agents in realistic settings, highlighting practical insights for improving robustness and transfer of learning to real-world enterprise tools.

Abstract

We present Agent-Diff, a novel benchmarking framework for evaluating agentic Large Language Models (LLMs) on real-world tasks that execute code via external APIs. Agentic LLM performance varies due to differences in models, external tool access, prompt structures, and agentic frameworks. Benchmarks must make fundamental trade-offs between a sandboxed approach that controls for variation in software environments and more ecologically valid approaches employing real services. Agent-Diff attempts to capture the desirable features of both of these approaches by including access to the real API interfaces for software services while sandboxing the environment in which calls are made, processed, and evaluated. This approach relies on two key innovations. The first is a novel state-diff contract, which separates process from outcome - rather than fuzzy trace or parameter matching, we define task success as whether the expected change in environment state was achieved. The second is a novel sandbox that provides a standardized scripting layer that all models use to execute code against external APIs (Slack, Box, Linear, Google Calendar). Thus, we can evaluate different agentic LLMs against a standardized set of contracts using a unified sandbox while still evaluating their performance on real-world service interfaces. Using the Agent-Diff framework, we provide benchmarks for nine LLMs across 224 tasks utilizing enterprise software workflows. In addition, we evaluate the robustness of the framework with ablation experiments to assess the contribution of access to API documentation on benchmark performance. Code and data: https://github.com/agent-diff-bench/agent-diff.
Paper Structure (132 sections, 16 equations, 8 figures, 13 tables)

This paper contains 132 sections, 16 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: End-to-end sandbox architecture. The agent emits code (Bash/Python) that executes inside a container. All network traffic is intercepted and routed to containerized API replicas backed by per-environment PostgreSQL schemas. Entity tables are snapshotted to produce the DiffResult used for evaluation. Component details in Appendix \ref{['sec:sandbox-architecture']}.
  • Figure 2: Example assertion set for a two-step task. diff_type specifies which partition of the state diff to query: added = entities present in $S_1$ but not $S_0$; deleted = entities in $S_0$ but not $S_1$; updated = entities whose primary key persists but one or more field values changed. The where clause filters by field-level predicates; expected_count sets the required number of matching rows. This task contributes $m(\tau)=2$ to the denominator of the aggregate score.
  • Figure 3: Claude-Haiku-4.5 on "Organize Research Hub" ($n^* = 3$). Left: Without documentation, the model attempts Collections API, receives null responses, and hallucinates task completion. Right: With relevant Box documentation, the model uses the correct Hub endpoints (POST /hubs, POST /hubs/{id}/manage_items) and completes the task perfectly. Full trace with agent reasoning in Appendix \ref{['app:hub_traces']}.
  • Figure 4: Error prevalence vs. recovery rate under no-docs (open circles) and relevant-docs (filled circles) conditions. Arrows show the shift when documentation is provided.
  • Figure 5: Analysis of recovery strategy effectiveness across model performance tiers. $\Delta = \bar{x}_{\mathrm{top}} - \bar{x}_{\mathrm{bottom}}$ is the posterior mean usage rate difference between top-performing models and bottom-performing models. Filled bars indicate strategies used significantly more by top models; open bars indicate strategies used significantly less ($P > 0.95$).
  • ...and 3 more figures