Classical and Deep Reinforcement Learning Inventory Control Policies for Pharmaceutical Supply Chains with Perishability and Non-Stationarity

Francesco Stranieri; Chaaben Kouki; Willem van Jaarsveld; Fabio Stella

Classical and Deep Reinforcement Learning Inventory Control Policies for Pharmaceutical Supply Chains with Perishability and Non-Stationarity

Francesco Stranieri, Chaaben Kouki, Willem van Jaarsveld, Fabio Stella

TL;DR

The paper tackles pharmaceutical inventory control under complex real-world constraints: perishability, yield uncertainty, and non-stationary demand. It develops a realistic BMS-collaborated case study and systematically benchmarks three policy families—OUT, PIL, and PPO—against a human baseline, using bounds-based optimization for OUT and PIL and a PPO DRL approach with demand-forecast-aware features including projected inventory levels $\mathbb{E}[\mathbf{x}_{t+L}]$. Key contributions include bounds-based procedures for selecting OUT and PIL parameters, a novel feature design for PPO that incorporates non-stationarity and life-cycle information, and a comprehensive set of numerical experiments across two demand scenarios. The findings reveal that while DRL (PPO) has potential in handling complex, variable environments, it does not universally outperform classical policies; PIL offers robust, consistent performance, OUT can be competitive in some regimes but is fragile under high lost-sales costs, and human baselines remain strong in terms of service and risk management. Practically, the work suggests leveraging a portfolio of policies—potentially hybridized with forecasting and life-cycle awareness—to address pharmaceutical inventory challenges more effectively than any single policy class.

Abstract

We study inventory control policies for pharmaceutical supply chains, addressing challenges such as perishability, yield uncertainty, and non-stationary demand, combined with batching constraints, lead times, and lost sales. Collaborating with Bristol-Myers Squibb (BMS), we develop a realistic case study incorporating these factors and benchmark three policies--order-up-to (OUT), projected inventory level (PIL), and deep reinforcement learning (DRL) using the proximal policy optimization (PPO) algorithm--against a BMS baseline based on human expertise. We derive and validate bounds-based procedures for optimizing OUT and PIL policy parameters and propose a methodology for estimating projected inventory levels, which are also integrated into the DRL policy with demand forecasts to improve decision-making under non-stationarity. Compared to a human-driven policy, which avoids lost sales through higher holding costs, all three implemented policies achieve lower average costs but exhibit greater cost variability. While PIL demonstrates robust and consistent performance, OUT struggles under high lost sales costs, and PPO excels in complex and variable scenarios but requires significant computational effort. The findings suggest that while DRL shows potential, it does not outperform classical policies in all numerical experiments, highlighting 1) the need to integrate diverse policies to manage pharmaceutical challenges effectively, based on the current state-of-the-art, and 2) that practical problems in this domain seem to lack a single policy class that yields universally acceptable performance.

Classical and Deep Reinforcement Learning Inventory Control Policies for Pharmaceutical Supply Chains with Perishability and Non-Stationarity

TL;DR

. Key contributions include bounds-based procedures for selecting OUT and PIL parameters, a novel feature design for PPO that incorporates non-stationarity and life-cycle information, and a comprehensive set of numerical experiments across two demand scenarios. The findings reveal that while DRL (PPO) has potential in handling complex, variable environments, it does not universally outperform classical policies; PIL offers robust, consistent performance, OUT can be competitive in some regimes but is fragile under high lost-sales costs, and human baselines remain strong in terms of service and risk management. Practically, the work suggests leveraging a portfolio of policies—potentially hybridized with forecasting and life-cycle awareness—to address pharmaceutical inventory challenges more effectively than any single policy class.

Abstract

Paper Structure (29 sections, 2 theorems, 50 equations, 6 figures, 4 tables)

This paper contains 29 sections, 2 theorems, 50 equations, 6 figures, 4 tables.

Introduction
Related Work
Classical Policies
Deep Reinforcement Learning
Case Study Description
Order of Events
Cost Transformation
Dynamic Programming Formulation
Inventory Policies
OUT Policy
OUT Policy Lower Bound
OUT Policy Upper Bound
BMS Baseline Policy
BMS Baseline Policy Safety Stock
PIL Policy
...and 14 more sections

Key Result

Lemma 1

The total cost defined in Equation Eq:tot cost can be expressed as: where $h = \hat{h}$, $b = \hat{b} - \hat{c}$, and $w = \hat{w} + \hat{c}$.

Figures (6)

Figure 1: Representation of the supply chain environment.
Figure 2: Order of events in the supply chain environment.
Figure 3: A 95% confidence interval based on 2000 simulated episodes for the demand in the second scenario derived from real-world data over an episode horizon of $T = 240$ timesteps.
Figure 4: Average total cost for the PPO algorithm, with lower (LB), upper (UB), and optimal (OPT) values for the OUT and PIL policies. Demand noise is modeled as $\xi_t \sim \mathcal{N}(0, \bar{d} \times 15\%)$, where $\bar{d} = \max_t{d_t}$. Each row corresponds to a different value of $m = \{2, 3, 4\}$, and each column to a different value of $w = \{2, 4\}$. Each subplot shows the OUT, PIL, and PPO costs for $b = \{10, 50, 100, 1000\}$.
Figure 5: Bar plots representing the average total cost over 2000 simulated episodes, with demand noise modeled as $\xi_t \sim \mathcal{N}(0, \bar{d} \times 15\%)$, where $\bar{d} = \max_t{d_t}$. Each row corresponds to a different value of $m = \{2, 3, 4\}$, and each column to a different value of $w = \{2, 4\}$. In each subplot, the bars show the OUT, PIL, and PPO costs for $b = \{10, 50, 100, 1000\}$.
...and 1 more figures

Theorems & Definitions (6)

Lemma 1
proof
Lemma 2
proof
proof
proof

Classical and Deep Reinforcement Learning Inventory Control Policies for Pharmaceutical Supply Chains with Perishability and Non-Stationarity

TL;DR

Abstract

Classical and Deep Reinforcement Learning Inventory Control Policies for Pharmaceutical Supply Chains with Perishability and Non-Stationarity

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (6)