Table of Contents
Fetching ...

SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents

Subhrangshu Nandi, Arghya Datta, Rohith Nama, Udita Patel, Nikhil Vichare, Indranil Bhattacharya, Prince Grover, Shivam Asija, Giuseppe Carenini, Wei Zhang, Arushi Gupta, Sreyoshi Bhaduri, Jing Xu, Huzefa Raja, Shayan Ray, Aaron Chan, Esther Xu Fei, Gaoyuan Du, Zuhaib Akhtar, Harshita Asnani, Weian Chan, Ming Xiong, Francesco Carbone, Jeetu Mirchandani

TL;DR

SOP-Bench is introduced, a benchmark of 2,000+ tasks from human expert-authored SOPs across 12 business domains using a human-AI collaborative framework that enables the researchers and practitioners to systematically investigate agent design choices, model selection, and deployment strategies across diverse procedural tasks.

Abstract

LLM-based agents struggle to execute complex, multi-step Standard Operating Procedures (SOPs) that are fundamental to industrial automation. Existing benchmarks fail to capture the procedural complexity and tool orchestration demands of real-world workflows. We introduce SOP-Bench, a benchmark of 2,000+ tasks from human expert-authored SOPs across 12 business domains (healthcare, logistics, finance, content moderation, etc.). Using a human-AI collaborative framework, experts crafted authentic SOPs while AI generated artifacts (tools, APIs, datasets), all human-validated, yielding realistic tasks with executable interfaces and ground-truth outputs. SOP-Bench serves as a research enabler for systematically investigating agent architectures, model capabilities, and deployment considerations across diverse procedural tasks. We demonstrate its utility through illustrative experiments with a subset of frontier models across Function-Calling (FC) and ReAct agents, revealing critical insights. For example, (1) newer models do not guarantee better performance - Claude 4 family outperforms Claude 4.5 family on ReAct tasks (Claude 4 Opus: 72.4% vs. Claude 4.5 Sonnet: 63.3% task success rate), demonstrating that production upgrades require validation; (2) no single model-agent combination dominates: best performances range from 57% to 100% depending on domain. These examples illustrate how SOP-Bench enables isolating and studying specific dimensions of agent performance without costly production experiments. Our goal is not to rank model capabilities or build optimal agents, but to provide a rigorous evaluation framework that enables the researchers and practitioners to systematically investigate agent design choices, model selection, and deployment strategies. We release the benchmark at https://github.com/amazon-science/sop-bench.

SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents

TL;DR

SOP-Bench is introduced, a benchmark of 2,000+ tasks from human expert-authored SOPs across 12 business domains using a human-AI collaborative framework that enables the researchers and practitioners to systematically investigate agent design choices, model selection, and deployment strategies across diverse procedural tasks.

Abstract

LLM-based agents struggle to execute complex, multi-step Standard Operating Procedures (SOPs) that are fundamental to industrial automation. Existing benchmarks fail to capture the procedural complexity and tool orchestration demands of real-world workflows. We introduce SOP-Bench, a benchmark of 2,000+ tasks from human expert-authored SOPs across 12 business domains (healthcare, logistics, finance, content moderation, etc.). Using a human-AI collaborative framework, experts crafted authentic SOPs while AI generated artifacts (tools, APIs, datasets), all human-validated, yielding realistic tasks with executable interfaces and ground-truth outputs. SOP-Bench serves as a research enabler for systematically investigating agent architectures, model capabilities, and deployment considerations across diverse procedural tasks. We demonstrate its utility through illustrative experiments with a subset of frontier models across Function-Calling (FC) and ReAct agents, revealing critical insights. For example, (1) newer models do not guarantee better performance - Claude 4 family outperforms Claude 4.5 family on ReAct tasks (Claude 4 Opus: 72.4% vs. Claude 4.5 Sonnet: 63.3% task success rate), demonstrating that production upgrades require validation; (2) no single model-agent combination dominates: best performances range from 57% to 100% depending on domain. These examples illustrate how SOP-Bench enables isolating and studying specific dimensions of agent performance without costly production experiments. Our goal is not to rank model capabilities or build optimal agents, but to provide a rigorous evaluation framework that enables the researchers and practitioners to systematically investigate agent design choices, model selection, and deployment strategies. We release the benchmark at https://github.com/amazon-science/sop-bench.

Paper Structure

This paper contains 55 sections, 6 figures, 20 tables.

Figures (6)

  • Figure 1: SOP-Bench evaluation overview. Realistic business process SOPs authored by human experts across diverse domains are converted into executable task instances with structured tool/API interfaces and ground-truth outputs. LLM agents execute tasks via reproducible tool interactions, producing execution trajectories evaluated using grounded, outcome-aware metrics (ECR, C-TSR, TSR).
  • Figure 2: Example of instructions that can be simple for humans but confusing for LLMs. In this excerpt of page 2 of the UNHTS-SOP (appendix \ref{['appendix:real_sop']}), we can see that steps 4 and 6 both have instructions on verifying the insurance, but does not have explicit information on how to verify it. Moreover, it is not clear why there are two steps of insurance verifications. For human associates with implicit knowledge of the domains this might not be as ambiguous, while our experiments indicate that it is for LLMs. In our experience, this is a common occurrence in business process SOPs.
  • Figure 3: SOP-Bench Generation Workflow: Human experts author SOPs and validate all artifacts; AI generates tools, APIs, and datasets. ** denotes post-generation complexity introduction.
  • Figure 4: Sample API and ToolSpec generated for the SOP related to Patient Intake in Healthcare industry
  • Figure 5: Detailed Sample of Business Task and Task Context for the Patient Intake SOP relevant to healthcare industry
  • ...and 1 more figures