Table of Contents
Fetching ...

AI Agents for Inventory Control: Human-LLM-OR Complementarity

Jackie Baek, Yaopeng Fu, Will Ma, Tianyi Peng

TL;DR

The paper investigates how operations research (OR) algorithms, large language model (LLM) agents, and humans can complement each other in multi-period inventory control. It introduces InventoryBench, a benchmark of 1,320 demand scenarios (synthetic and real) to stress-test decision rules under shifts, seasonality, and uncertain lead times, and demonstrates that OR–LLM hybrids outperform each component alone. Through a controlled classroom experiment, it shows that human–AI teams achieve higher profits on average than humans or AI alone, and formalizes population- and individual-level complementarity, providing a distribution-free lower bound on the share of individuals who benefit from AI collaboration (estimated at least 20.3%). The work also offers two resources: InventoryBench with a public leaderboard and an open-source web-based Inventory Game for teaching and research, and provides a theoretical framework for evaluating complementarity in human–AI decision-making. Overall, the results indicate that OR provides precise base-stock control while LLMs contribute robust pattern recognition and context-based forecasting, with humans offering strategic guidance and correction, yielding superior performance in inventory management settings.

Abstract

Inventory control is a fundamental operations problem in which ordering decisions are traditionally guided by theoretically grounded operations research (OR) algorithms. However, such algorithms often rely on rigid modeling assumptions and can perform poorly when demand distributions shift or relevant contextual information is unavailable. Recent advances in large language models (LLMs) have generated interest in AI agents that can reason flexibly and incorporate rich contextual signals, but it remains unclear how best to incorporate LLM-based methods into traditional decision-making pipelines. We study how OR algorithms, LLMs, and humans can interact and complement each other in a multi-period inventory control setting. We construct InventoryBench, a benchmark of over 1,000 inventory instances spanning both synthetic and real-world demand data, designed to stress-test decision rules under demand shifts, seasonality, and uncertain lead times. Through this benchmark, we find that OR-augmented LLM methods outperform either method in isolation, suggesting that these methods are complementary rather than substitutes. We further investigate the role of humans through a controlled classroom experiment that embeds LLM recommendations into a human-in-the-loop decision pipeline. Contrary to prior findings that human-AI collaboration can degrade performance, we show that, on average, human-AI teams achieve higher profits than either humans or AI agents operating alone. Beyond this population-level finding, we formalize an individual-level complementarity effect and derive a distribution-free lower bound on the fraction of individuals who benefit from AI collaboration; empirically, we find this fraction to be substantial.

AI Agents for Inventory Control: Human-LLM-OR Complementarity

TL;DR

The paper investigates how operations research (OR) algorithms, large language model (LLM) agents, and humans can complement each other in multi-period inventory control. It introduces InventoryBench, a benchmark of 1,320 demand scenarios (synthetic and real) to stress-test decision rules under shifts, seasonality, and uncertain lead times, and demonstrates that OR–LLM hybrids outperform each component alone. Through a controlled classroom experiment, it shows that human–AI teams achieve higher profits on average than humans or AI alone, and formalizes population- and individual-level complementarity, providing a distribution-free lower bound on the share of individuals who benefit from AI collaboration (estimated at least 20.3%). The work also offers two resources: InventoryBench with a public leaderboard and an open-source web-based Inventory Game for teaching and research, and provides a theoretical framework for evaluating complementarity in human–AI decision-making. Overall, the results indicate that OR provides precise base-stock control while LLMs contribute robust pattern recognition and context-based forecasting, with humans offering strategic guidance and correction, yielding superior performance in inventory management settings.

Abstract

Inventory control is a fundamental operations problem in which ordering decisions are traditionally guided by theoretically grounded operations research (OR) algorithms. However, such algorithms often rely on rigid modeling assumptions and can perform poorly when demand distributions shift or relevant contextual information is unavailable. Recent advances in large language models (LLMs) have generated interest in AI agents that can reason flexibly and incorporate rich contextual signals, but it remains unclear how best to incorporate LLM-based methods into traditional decision-making pipelines. We study how OR algorithms, LLMs, and humans can interact and complement each other in a multi-period inventory control setting. We construct InventoryBench, a benchmark of over 1,000 inventory instances spanning both synthetic and real-world demand data, designed to stress-test decision rules under demand shifts, seasonality, and uncertain lead times. Through this benchmark, we find that OR-augmented LLM methods outperform either method in isolation, suggesting that these methods are complementary rather than substitutes. We further investigate the role of humans through a controlled classroom experiment that embeds LLM recommendations into a human-in-the-loop decision pipeline. Contrary to prior findings that human-AI collaboration can degrade performance, we show that, on average, human-AI teams achieve higher profits than either humans or AI agents operating alone. Beyond this population-level finding, we formalize an individual-level complementarity effect and derive a distribution-free lower bound on the fraction of individuals who benefit from AI collaboration; empirically, we find this fraction to be substantial.
Paper Structure (83 sections, 2 theorems, 17 equations, 14 figures, 14 tables)

This paper contains 83 sections, 2 theorems, 17 equations, 14 figures, 14 tables.

Key Result

Theorem 1

Let $F_P$ and $F_Q$ denote the CDFs of $P$ and $Q$, respectively. Then for any $\delta \geq 0$, Moreover, this bound is tight: for any marginal distributions $P$ and $Q$, there exists a joint distribution of $(a,b)$ with these marginals that achieves equality.

Figures (14)

  • Figure 1: Decision panels for the three collaboration modes. Mode A: the participant sees the OR recommendation and enters a final order. Mode B: the participant sees the OR-augmented LLM recommendation and reasoning before entering a final order. Mode C: the AI makes ordering decisions autonomously; the participant provides optional strategic guidance at scheduled pauses (every 4 periods).
  • Figure 2: The complementary interaction pattern of OR algorithms, LLM agents, and humans.
  • Figure 3: Architecture of the OR$\to$LLM agent. Each period involves a single LLM call with a fixed system prompt and a period-specific user message, where the "user" here need not involve a human. The agent outputs an order quantity with rationale and optionally updates carry-over insights, which persist to the next period's input.
  • Figure 4: Gemini 3 Flash: overall normalized reward (mean $\pm$ 95% CI) across all 1,320 instances, by method. OR$\to$LLM achieves the highest mean (0.538), with the other hybrid method (LLM$\to$OR) coming second.
  • Figure 5: Gemini 3 Flash: normalized reward by lead time setting (440 instances per setting).
  • ...and 9 more figures

Theorems & Definitions (8)

  • Theorem 1
  • proof : Proof sketch
  • Corollary 1
  • Example 1: LLM detects lost orders
  • Example 2: LLM overfits
  • Example 3: LLM detects synthetic demand shift
  • Example 4: LLM uses world knowledge
  • proof