Table of Contents
Fetching ...

SOP-Maze: Evaluating Large Language Models on Complicated Business Standard Operating Procedures

Jiaming Wang, Zhe Tang, Yilin Jin, Peng Ding, Xiaoyu Li, Xuezhi Cao

TL;DR

SOP-Maze presents a first real-world benchmark to evaluate large language models on business standard operating procedures, addressing gaps in prior instruction-following benchmarks. It assembles 397 tasks across 23 scenarios, categorized as LRS (breadth) and HRS (depth), with a four-component task format and a JSON Schema-based evaluation using reference indices. Across 18 state-of-the-art models, results reveal substantial difficulties in adhering to complex SOPs, identifying three main error modes: route blindness, conversational fragility, and calculation errors, with reasoning-enabled models outperforming non-reasoning ones. Ablation studies show that simplifying context, dialogue, and calculations can mitigate some failures, but many challenges persist, highlighting areas for improvement in robust instruction following in business domains. The dataset and code are open-sourced to catalyze further research.

Abstract

As large language models (LLMs) are widely deployed as domain-specific agents, many benchmarks have been proposed to evaluate their ability to follow instructions and make decisions in real-world scenarios. However, business scenarios often involve complex standard operating procedures (SOPs), and the evaluation of LLM capabilities in such contexts has not been fully explored. To bridge this gap, we propose SOP-Maze, a benchmark constructed from real-world business data and adapted into a collection of 397 tasks from 23 complex SOP scenarios. We further categorize SOP tasks into two broad classes: Lateral Root System (LRS), representing wide-option tasks that demand precise selection; and Heart Root System (HRS), which emphasizes deep logical reasoning with complex branches. Extensive experiments reveal that nearly all state-of-the-art models struggle with SOP-Maze. We conduct a comprehensive analysis and identify three key error categories: (i) route blindness: difficulty following procedures; (ii) conversational fragility: inability to handle real dialogue nuances; and (iii) calculation errors: mistakes in time or arithmetic reasoning under complex contexts. The systematic study explores LLM performance across SOP tasks that challenge both breadth and depth, offering new insights for improving model capabilities. We have open-sourced our work on https://github.com/ADoublLEN/SOP-Maze.

SOP-Maze: Evaluating Large Language Models on Complicated Business Standard Operating Procedures

TL;DR

SOP-Maze presents a first real-world benchmark to evaluate large language models on business standard operating procedures, addressing gaps in prior instruction-following benchmarks. It assembles 397 tasks across 23 scenarios, categorized as LRS (breadth) and HRS (depth), with a four-component task format and a JSON Schema-based evaluation using reference indices. Across 18 state-of-the-art models, results reveal substantial difficulties in adhering to complex SOPs, identifying three main error modes: route blindness, conversational fragility, and calculation errors, with reasoning-enabled models outperforming non-reasoning ones. Ablation studies show that simplifying context, dialogue, and calculations can mitigate some failures, but many challenges persist, highlighting areas for improvement in robust instruction following in business domains. The dataset and code are open-sourced to catalyze further research.

Abstract

As large language models (LLMs) are widely deployed as domain-specific agents, many benchmarks have been proposed to evaluate their ability to follow instructions and make decisions in real-world scenarios. However, business scenarios often involve complex standard operating procedures (SOPs), and the evaluation of LLM capabilities in such contexts has not been fully explored. To bridge this gap, we propose SOP-Maze, a benchmark constructed from real-world business data and adapted into a collection of 397 tasks from 23 complex SOP scenarios. We further categorize SOP tasks into two broad classes: Lateral Root System (LRS), representing wide-option tasks that demand precise selection; and Heart Root System (HRS), which emphasizes deep logical reasoning with complex branches. Extensive experiments reveal that nearly all state-of-the-art models struggle with SOP-Maze. We conduct a comprehensive analysis and identify three key error categories: (i) route blindness: difficulty following procedures; (ii) conversational fragility: inability to handle real dialogue nuances; and (iii) calculation errors: mistakes in time or arithmetic reasoning under complex contexts. The systematic study explores LLM performance across SOP tasks that challenge both breadth and depth, offering new insights for improving model capabilities. We have open-sourced our work on https://github.com/ADoublLEN/SOP-Maze.

Paper Structure

This paper contains 37 sections, 1 equation, 7 figures, 3 tables.

Figures (7)

  • Figure 1: An example of business SOPs.
  • Figure 2: Illustration of SOP-Maze. Based on the context and characteristics of the SOPs, business SOP tasks are categorized into two types, LRS and HRS. Each task prompt comprises 4 key components: Objective, Standard Operating Procedures, User Input and Output Requirement. After the LLM generates an output, it is assessed using JSON Schema based Evaluation.
  • Figure 3: Task distribution
  • Figure 4: Reasoning models outperform non-reasoning models in most scenarios within SOP-Maze.
  • Figure 5: Experiment of Route Blindness Ablation on "Bulk Order Clarification"
  • ...and 2 more figures