DeepStock: Reinforcement Learning with Policy Regularizations for Inventory Management

Yaqi Xie; Xinru Hao; Jiaxi Liu; Will Ma; Linwei Xin; Lei Cao; Yidong Zhang

DeepStock: Reinforcement Learning with Policy Regularizations for Inventory Management

Yaqi Xie, Xinru Hao, Jiaxi Liu, Will Ma, Linwei Xin, Lei Cao, Yidong Zhang

Abstract

Deep Reinforcement Learning (DRL) provides a general-purpose methodology for training inventory policies that can leverage big data and compute. However, off-the-shelf implementations of DRL have seen mixed success, often plagued by high sensitivity to the hyperparameters used during training. In this paper, we show that by imposing policy regularizations, grounded in classical inventory concepts such as "Base Stock", we can significantly accelerate hyperparameter tuning and improve the final performance of several DRL methods. We report details from a 100% deployment of DRL with policy regularizations on Alibaba's e-commerce platform, Tmall. We also include extensive synthetic experiments, which show that policy regularizations reshape the narrative on what is the best DRL method for inventory management.

DeepStock: Reinforcement Learning with Policy Regularizations for Inventory Management

Abstract

Paper Structure (30 sections, 5 equations, 5 figures, 9 tables)

This paper contains 30 sections, 5 equations, 5 figures, 9 tables.

Introduction
Description of DRL and Policy Regularizations
Main Results
Paper outline.
State of the Literature on RL for Inventory
Policy regularizations.
DS vs. traditional DRL for inventory.
Large-scale deployments of DRL for inventory.
Inventory Model, Metrics, and Policies
Inventory dynamics.
Performance metrics.
Contexts and policies.
Discussion of model assumptions.
Training using Deep Reinforcement Learning (DRL)
Overview of DRL Training Methods
...and 15 more sections

Figures (5)

Figure 1: Validation and Testing Loss Gaps for the 6 combinations of DRL Method and Policy Regularization.
Figure 2: Validation Loss Gaps for the top-5 hyperparameter configurations, shown for each combination of DRL Method and Policy Regularization, in Setting 1.
Figure 3: Critic loss and states during the first 10,000 timesteps in Setting 1 under a hyperparameter configuration tuned for DDPGNone. The training starts after 500 transitions are collected via random actions. Each dot in the two bottom subfigures corresponds to a single visit to state $s=(I_t, x_t)$.
Figure 4: Testing and Validation Loss Gaps for $\textsc{DS}\xspace$ with $\textsc{Base}\xspace$ Regularization in Setting 4.
Figure 5: Evolution of average Turnover Time for international SKU's during July--August, in 2024 and 2025. All numbers are normalized relative to the maximum average Turnover Time encountered in either year.

DeepStock: Reinforcement Learning with Policy Regularizations for Inventory Management

Abstract

DeepStock: Reinforcement Learning with Policy Regularizations for Inventory Management

Authors

Abstract

Table of Contents

Figures (5)