AI Agents for Inventory Control: Human-LLM-OR Complementarity
Jackie Baek, Yaopeng Fu, Will Ma, Tianyi Peng
TL;DR
The paper investigates how operations research (OR) algorithms, large language model (LLM) agents, and humans can complement each other in multi-period inventory control. It introduces InventoryBench, a benchmark of 1,320 demand scenarios (synthetic and real) to stress-test decision rules under shifts, seasonality, and uncertain lead times, and demonstrates that OR–LLM hybrids outperform each component alone. Through a controlled classroom experiment, it shows that human–AI teams achieve higher profits on average than humans or AI alone, and formalizes population- and individual-level complementarity, providing a distribution-free lower bound on the share of individuals who benefit from AI collaboration (estimated at least 20.3%). The work also offers two resources: InventoryBench with a public leaderboard and an open-source web-based Inventory Game for teaching and research, and provides a theoretical framework for evaluating complementarity in human–AI decision-making. Overall, the results indicate that OR provides precise base-stock control while LLMs contribute robust pattern recognition and context-based forecasting, with humans offering strategic guidance and correction, yielding superior performance in inventory management settings.
Abstract
Inventory control is a fundamental operations problem in which ordering decisions are traditionally guided by theoretically grounded operations research (OR) algorithms. However, such algorithms often rely on rigid modeling assumptions and can perform poorly when demand distributions shift or relevant contextual information is unavailable. Recent advances in large language models (LLMs) have generated interest in AI agents that can reason flexibly and incorporate rich contextual signals, but it remains unclear how best to incorporate LLM-based methods into traditional decision-making pipelines. We study how OR algorithms, LLMs, and humans can interact and complement each other in a multi-period inventory control setting. We construct InventoryBench, a benchmark of over 1,000 inventory instances spanning both synthetic and real-world demand data, designed to stress-test decision rules under demand shifts, seasonality, and uncertain lead times. Through this benchmark, we find that OR-augmented LLM methods outperform either method in isolation, suggesting that these methods are complementary rather than substitutes. We further investigate the role of humans through a controlled classroom experiment that embeds LLM recommendations into a human-in-the-loop decision pipeline. Contrary to prior findings that human-AI collaboration can degrade performance, we show that, on average, human-AI teams achieve higher profits than either humans or AI agents operating alone. Beyond this population-level finding, we formalize an individual-level complementarity effect and derive a distribution-free lower bound on the fraction of individuals who benefit from AI collaboration; empirically, we find this fraction to be substantial.
