SupChain-Bench: Benchmarking Large Language Models for Real-World Supply Chain Management

Shengyue Guan; Yihao Liu; Lang Cao

SupChain-Bench: Benchmarking Large Language Models for Real-World Supply Chain Management

Shengyue Guan, Yihao Liu, Lang Cao

TL;DR

SupChain-Bench tackles the challenge of evaluating LLMs in real-world supply chain management by jointly assessing domain knowledge and long-horizon tool-based orchestration. It introduces a dual benchmark with a Knowledge QA component and a Function-Calling component in a simulated SOP-governed environment, plus an SOP-free SupChain-ReAct approach that derives procedural guidance from reasoning. The results reveal substantial gaps in execution reliability across models and show that incorporating domain context or autonomous procedural synthesis can markedly boost tool-use performance. Overall, the work provides a principled, domain-specific benchmark for studying reliable long-horizon orchestration in operational settings and highlights substantial room for improvement in LLM-based supply chain agents.

Abstract

Large language models (LLMs) have shown promise in complex reasoning and tool-based decision making, motivating their application to real-world supply chain management. However, supply chain workflows require reliable long-horizon, multi-step orchestration grounded in domain-specific procedures, which remains challenging for current models. To systematically evaluate LLM performance in this setting, we introduce SupChain-Bench, a unified real-world benchmark that assesses both supply chain domain knowledge and long-horizon tool-based orchestration grounded in standard operating procedures (SOPs). Our experiments reveal substantial gaps in execution reliability across models. We further propose SupChain-ReAct, an SOP-free framework that autonomously synthesizes executable procedures for tool use, achieving the strongest and most consistent tool-calling performance. Our work establishes a principled benchmark for studying reliable long-horizon orchestration in real-world operational settings and highlights significant room for improvement in LLM-based supply chain agents.

SupChain-Bench: Benchmarking Large Language Models for Real-World Supply Chain Management

TL;DR

Abstract

Paper Structure (39 sections, 8 figures, 14 tables)

This paper contains 39 sections, 8 figures, 14 tables.

Introduction
Related Work
Benchmark Design
QA Dataset Curation
Function Calling Dataset Construction
Experiments
Settings
Main Results
Enhancing LLM with Supply Chain Context.
Long-Horizon Tool-Use Persistence Across LLMs.
Complexity-stratified Accuracy and Error Modes
Efficiency Analysis
SupChain-ReAct
Conclusion
Ethical Considerations
...and 24 more sections

Figures (8)

Figure 1: Overall composition of SupChain-Bench. The figure shows the distribution of annotated samples across three major functional domains of supply chain management: Logistics Collaboration & Cross-Border, Fulfillment & Warehouse Operations, and Finance, Planning & Customs. Each domain is further decomposed into its constituent sub-tasks, highlighting the relative proportions of different operational activities represented in the dataset.
Figure 2: The dataset construction pipeline of SupChain-Bench follows a four-stage quality-assurance process. First, supply chain documents are curated and sanitized to construct a structured knowledge base. Second, a multi-agent LLM framework generates diverse QA candidates across multiple question formats. Third, these candidates undergo model-driven refinement to improve clarity, consistency, and factual correctness. Finally, human annotators perform strict verification, and only QA pairs that receive unanimous approval are included in the benchmark.
Figure 3: Distribution of function-calling questions by tool complexity in the dataset. Each point corresponds to a unique question, where the x-axis denotes the number of execution steps required and the y-axis indicates the number of distinct tools invoked.
Figure 4: Overlaid histograms compare the number of tool calls per task for four models (gpt5, gemini-2.5-pro, qwen3-max, and claude-4-sonnet) under No SOP (blue) versus SOP (red). Across models, introducing an SOP generally shifts the distribution toward higher tool-call counts and produces heavier right tails, indicating longer and more persistent tool-use chains, while the no-SOP setting more often concentrates on shorter sequences and shows more early termination.
Figure 5: Question generation prompt (part 1 of 2).
...and 3 more figures

SupChain-Bench: Benchmarking Large Language Models for Real-World Supply Chain Management

TL;DR

Abstract

SupChain-Bench: Benchmarking Large Language Models for Real-World Supply Chain Management

Authors

TL;DR

Abstract

Table of Contents

Figures (8)