VC Theory for Inventory Policies
Yaqi Xie, Will Ma, Linwei Xin
TL;DR
VC Theory for Inventory Policies develops a framework that blends trajectory-based RL with classical inventory-structure regularization, enabling learning of base-stock and $(s,S)$ policies under arbitrary demand sequences. By encoding policy classes as $\Pi_S$, $\Pi_{(s,S)}$, and $\Pi_{(S^t)}$ and applying VC-dimension variants (Rademacher complexity, Pseudo-dimension, and Pseudo$_\gamma$-dimension), the authors derive horizon-free or near horizon-free generalization guarantees and quantify sample complexities. Key findings show that horizon growth does not inflate generalization error for $S$ and $(S^t)$ policies, while $(s,S)$ policies incur a mild log-$T$ dependence; lower bounds confirm the tightness of these results in several regimes. They also compare ERM with PERM under independent versus correlated demands, showing data efficiency benefits when temporal structure is preserved, and stronger robustness to dependence when using $\Pi$-constrained ERM. Together, the results offer practical guidance on policy class selection and data requirements for data-driven inventory management, backed by numerical experiments illustrating the theory in action.
Abstract
There has been growing interest in applying reinforcement learning (RL) to inventory management, either by optimizing over temporal transitions or by learning directly from full historical demand trajectories. This contrasts sharply with classical data-driven approaches, which first estimate demand distributions from past data and then compute well-structured optimal policies via dynamic programming. This paper considers a hybrid approach that combines trajectory-based RL with policy regularization imposing base-stock and $(s, S) $ structures. We provide generalization guarantees for this combined approach for several well-known classes in a $T$-period dynamic inventory model, using tools from the celebrated Vapnik-Chervonenkis (VC) theory, such as the Pseudo-dimension and Fat-shattering dimension. Our results have implications for regret against the best-in-class policies, and allow for an arbitrary distribution over demand sequences, which makes no assumptions such as independence across time. Surprisingly, we prove that the class of policies defined by $T$ non-stationary base-stock levels exhibits a generalization error that does not grow with $T$, whereas the two-parameter $(s, S)$ policy class has a generalization error growing logarithmically with $T$. Overall, our analysis leverages specific inventory structures within the learning theory framework, and improves sample complexity guarantees even compared to existing results assuming independent demands.
