Table of Contents
Fetching ...

DeepStock: Reinforcement Learning with Policy Regularizations for Inventory Management

Yaqi Xie, Xinru Hao, Jiaxi Liu, Will Ma, Linwei Xin, Lei Cao, Yidong Zhang

Abstract

Deep Reinforcement Learning (DRL) provides a general-purpose methodology for training inventory policies that can leverage big data and compute. However, off-the-shelf implementations of DRL have seen mixed success, often plagued by high sensitivity to the hyperparameters used during training. In this paper, we show that by imposing policy regularizations, grounded in classical inventory concepts such as "Base Stock", we can significantly accelerate hyperparameter tuning and improve the final performance of several DRL methods. We report details from a 100% deployment of DRL with policy regularizations on Alibaba's e-commerce platform, Tmall. We also include extensive synthetic experiments, which show that policy regularizations reshape the narrative on what is the best DRL method for inventory management.

DeepStock: Reinforcement Learning with Policy Regularizations for Inventory Management

Abstract

Deep Reinforcement Learning (DRL) provides a general-purpose methodology for training inventory policies that can leverage big data and compute. However, off-the-shelf implementations of DRL have seen mixed success, often plagued by high sensitivity to the hyperparameters used during training. In this paper, we show that by imposing policy regularizations, grounded in classical inventory concepts such as "Base Stock", we can significantly accelerate hyperparameter tuning and improve the final performance of several DRL methods. We report details from a 100% deployment of DRL with policy regularizations on Alibaba's e-commerce platform, Tmall. We also include extensive synthetic experiments, which show that policy regularizations reshape the narrative on what is the best DRL method for inventory management.
Paper Structure (30 sections, 5 equations, 5 figures, 9 tables)

This paper contains 30 sections, 5 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Validation and Testing Loss Gaps for the 6 combinations of DRL Method and Policy Regularization.
  • Figure 2: Validation Loss Gaps for the top-5 hyperparameter configurations, shown for each combination of DRL Method and Policy Regularization, in Setting 1.
  • Figure 3: Critic loss and states during the first 10,000 timesteps in Setting 1 under a hyperparameter configuration tuned for DDPGNone. The training starts after 500 transitions are collected via random actions. Each dot in the two bottom subfigures corresponds to a single visit to state $s=(I_t, x_t)$.
  • Figure 4: Testing and Validation Loss Gaps for $\textsc{DS}\xspace$ with $\textsc{Base}\xspace$ Regularization in Setting 4.
  • Figure 5: Evolution of average Turnover Time for international SKU's during July--August, in 2024 and 2025. All numbers are normalized relative to the maximum average Turnover Time encountered in either year.