Table of Contents
Fetching ...

VC Theory for Inventory Policies

Yaqi Xie, Will Ma, Linwei Xin

TL;DR

VC Theory for Inventory Policies develops a framework that blends trajectory-based RL with classical inventory-structure regularization, enabling learning of base-stock and $(s,S)$ policies under arbitrary demand sequences. By encoding policy classes as $\Pi_S$, $\Pi_{(s,S)}$, and $\Pi_{(S^t)}$ and applying VC-dimension variants (Rademacher complexity, Pseudo-dimension, and Pseudo$_\gamma$-dimension), the authors derive horizon-free or near horizon-free generalization guarantees and quantify sample complexities. Key findings show that horizon growth does not inflate generalization error for $S$ and $(S^t)$ policies, while $(s,S)$ policies incur a mild log-$T$ dependence; lower bounds confirm the tightness of these results in several regimes. They also compare ERM with PERM under independent versus correlated demands, showing data efficiency benefits when temporal structure is preserved, and stronger robustness to dependence when using $\Pi$-constrained ERM. Together, the results offer practical guidance on policy class selection and data requirements for data-driven inventory management, backed by numerical experiments illustrating the theory in action.

Abstract

There has been growing interest in applying reinforcement learning (RL) to inventory management, either by optimizing over temporal transitions or by learning directly from full historical demand trajectories. This contrasts sharply with classical data-driven approaches, which first estimate demand distributions from past data and then compute well-structured optimal policies via dynamic programming. This paper considers a hybrid approach that combines trajectory-based RL with policy regularization imposing base-stock and $(s, S) $ structures. We provide generalization guarantees for this combined approach for several well-known classes in a $T$-period dynamic inventory model, using tools from the celebrated Vapnik-Chervonenkis (VC) theory, such as the Pseudo-dimension and Fat-shattering dimension. Our results have implications for regret against the best-in-class policies, and allow for an arbitrary distribution over demand sequences, which makes no assumptions such as independence across time. Surprisingly, we prove that the class of policies defined by $T$ non-stationary base-stock levels exhibits a generalization error that does not grow with $T$, whereas the two-parameter $(s, S)$ policy class has a generalization error growing logarithmically with $T$. Overall, our analysis leverages specific inventory structures within the learning theory framework, and improves sample complexity guarantees even compared to existing results assuming independent demands.

VC Theory for Inventory Policies

TL;DR

VC Theory for Inventory Policies develops a framework that blends trajectory-based RL with classical inventory-structure regularization, enabling learning of base-stock and policies under arbitrary demand sequences. By encoding policy classes as , , and and applying VC-dimension variants (Rademacher complexity, Pseudo-dimension, and Pseudo-dimension), the authors derive horizon-free or near horizon-free generalization guarantees and quantify sample complexities. Key findings show that horizon growth does not inflate generalization error for and policies, while policies incur a mild log- dependence; lower bounds confirm the tightness of these results in several regimes. They also compare ERM with PERM under independent versus correlated demands, showing data efficiency benefits when temporal structure is preserved, and stronger robustness to dependence when using -constrained ERM. Together, the results offer practical guidance on policy class selection and data requirements for data-driven inventory management, backed by numerical experiments illustrating the theory in action.

Abstract

There has been growing interest in applying reinforcement learning (RL) to inventory management, either by optimizing over temporal transitions or by learning directly from full historical demand trajectories. This contrasts sharply with classical data-driven approaches, which first estimate demand distributions from past data and then compute well-structured optimal policies via dynamic programming. This paper considers a hybrid approach that combines trajectory-based RL with policy regularization imposing base-stock and structures. We provide generalization guarantees for this combined approach for several well-known classes in a -period dynamic inventory model, using tools from the celebrated Vapnik-Chervonenkis (VC) theory, such as the Pseudo-dimension and Fat-shattering dimension. Our results have implications for regret against the best-in-class policies, and allow for an arbitrary distribution over demand sequences, which makes no assumptions such as independence across time. Surprisingly, we prove that the class of policies defined by non-stationary base-stock levels exhibits a generalization error that does not grow with , whereas the two-parameter policy class has a generalization error growing logarithmically with . Overall, our analysis leverages specific inventory structures within the learning theory framework, and improves sample complexity guarantees even compared to existing results assuming independent demands.
Paper Structure (34 sections, 12 theorems, 60 equations, 10 figures, 2 tables)

This paper contains 34 sections, 12 theorems, 60 equations, 10 figures, 2 tables.

Key Result

Proposition 1

For any distribution $\mathcal{D}\in \Delta([0, U]^{T+L})$, sample size $N\in\mathbb{Z}_{>0}$, and function class $\mathcal{F}(\Pi) = \left\{f(\pi, \cdot): \pi\in \Pi\right\}$, the following inequality holds:

Figures (10)

  • Figure 1: The expected EE's relative to the optimal loss, and OOS loss ratios, of $\Pi_S$, $\Pi_{(s, S)}$ and $\Pi_{(S^t)}$ policies when $K=0$. Note that for $T=1$, the OOS loss ratio of $\Pi_S$ and $\Pi_{(S^t)}$ (equivalent) is 1.029.
  • Figure 2: The OOS loss ratios of $\Pi_{(s,S)}$ and $\Pi_{S}$ when $K>0$. Subfigures (b), (c), and (d) present the results for different values of $T$, $K$, and $\sigma_0$, respectively, compared with (a).
  • Figure 3: The OOS loss ratios of $\Pi_{(S^t)}$ and $\Pi_{S}$ when $K=0$. Subfigures (b), (c), and (d) present the results for different values of $T$, $\mathsf{Nonst}$, and $\sigma_0$, respectively, compared with (a).
  • Figure 4: The OOS ratios of ERM approaches compared to PERM approach under independent demands.
  • Figure 5: The OOS ratios of ERM approaches compared to PERM approach under correlated demands.
  • ...and 5 more figures

Theorems & Definitions (22)

  • Definition 1: Rademacher complexity
  • Proposition 1
  • Definition 2: Pseudo-dimension
  • Proposition 2
  • Definition 3: Pseudo$_\gamma$-dimension
  • Proposition 3
  • Theorem 1: Proved in \ref{['pf:thm-base']}
  • Corollary 1
  • Remark 1
  • Theorem 2: Proved in \ref{['pf:thm-ss-ub']}
  • ...and 12 more