SupChain-Bench: Benchmarking Large Language Models for Real-World Supply Chain Management
Shengyue Guan, Yihao Liu, Lang Cao
TL;DR
SupChain-Bench tackles the challenge of evaluating LLMs in real-world supply chain management by jointly assessing domain knowledge and long-horizon tool-based orchestration. It introduces a dual benchmark with a Knowledge QA component and a Function-Calling component in a simulated SOP-governed environment, plus an SOP-free SupChain-ReAct approach that derives procedural guidance from reasoning. The results reveal substantial gaps in execution reliability across models and show that incorporating domain context or autonomous procedural synthesis can markedly boost tool-use performance. Overall, the work provides a principled, domain-specific benchmark for studying reliable long-horizon orchestration in operational settings and highlights substantial room for improvement in LLM-based supply chain agents.
Abstract
Large language models (LLMs) have shown promise in complex reasoning and tool-based decision making, motivating their application to real-world supply chain management. However, supply chain workflows require reliable long-horizon, multi-step orchestration grounded in domain-specific procedures, which remains challenging for current models. To systematically evaluate LLM performance in this setting, we introduce SupChain-Bench, a unified real-world benchmark that assesses both supply chain domain knowledge and long-horizon tool-based orchestration grounded in standard operating procedures (SOPs). Our experiments reveal substantial gaps in execution reliability across models. We further propose SupChain-ReAct, an SOP-free framework that autonomously synthesizes executable procedures for tool use, achieving the strongest and most consistent tool-calling performance. Our work establishes a principled benchmark for studying reliable long-horizon orchestration in real-world operational settings and highlights significant room for improvement in LLM-based supply chain agents.
