Table of Contents
Fetching ...

COOL-MC: Verifying and Explaining RL Policies for Platelet Inventory Management

Dennis Gross

TL;DR

COOL-MC is applied, a tool that combines RL with probabilistic model checking and explainable RL, to verify and explain a trained policy for the MDP on platelet inventory management inspired by Haijema et al.

Abstract

Platelets expire within five days. Blood banks face uncertain daily demand and must balance ordering decisions between costly wastage from overstocking and life-threatening shortages from understocking. Reinforcement learning (RL) can learn effective ordering policies for this Markov decision process (MDP), but the resulting neural policies remain black boxes, hindering trust and adoption in safety-critical domains. We apply COOL-MC, a tool that combines RL with probabilistic model checking and explainable RL, to verify and explain a trained policy for the MDP on platelet inventory management inspired by Haijema et al. By constructing a policy-induced discrete-time Markov chain (which includes only the reachable states under the trained policy to reduce memory usage), we verify PCTL properties and provide feature-level explanations. Results show that the trained policy achieves a 2.9% stockout probability and a 1.1% inventory-full (potential wastage) probability within a 200-step horizon, primarily attends to the age distribution of inventory rather than other features such as day of week or pending orders. Action reachability analysis reveals that the policy employs a diverse replenishment strategy, with most order quantities reached quickly, while several are never selected. Counterfactual analysis shows that replacing medium-large orders with smaller ones leaves both safety probabilities nearly unchanged, indicating that these orders are placed in well-buffered inventory states. This first formal verification and explanation of an RL platelet inventory management policy demonstrates COOL-MC's value for transparent, auditable decision-making in safety-critical healthcare supply chain domains.

COOL-MC: Verifying and Explaining RL Policies for Platelet Inventory Management

TL;DR

COOL-MC is applied, a tool that combines RL with probabilistic model checking and explainable RL, to verify and explain a trained policy for the MDP on platelet inventory management inspired by Haijema et al.

Abstract

Platelets expire within five days. Blood banks face uncertain daily demand and must balance ordering decisions between costly wastage from overstocking and life-threatening shortages from understocking. Reinforcement learning (RL) can learn effective ordering policies for this Markov decision process (MDP), but the resulting neural policies remain black boxes, hindering trust and adoption in safety-critical domains. We apply COOL-MC, a tool that combines RL with probabilistic model checking and explainable RL, to verify and explain a trained policy for the MDP on platelet inventory management inspired by Haijema et al. By constructing a policy-induced discrete-time Markov chain (which includes only the reachable states under the trained policy to reduce memory usage), we verify PCTL properties and provide feature-level explanations. Results show that the trained policy achieves a 2.9% stockout probability and a 1.1% inventory-full (potential wastage) probability within a 200-step horizon, primarily attends to the age distribution of inventory rather than other features such as day of week or pending orders. Action reachability analysis reveals that the policy employs a diverse replenishment strategy, with most order quantities reached quickly, while several are never selected. Counterfactual analysis shows that replacing medium-large orders with smaller ones leaves both safety probabilities nearly unchanged, indicating that these orders are placed in well-buffered inventory states. This first formal verification and explanation of an RL platelet inventory management policy demonstrates COOL-MC's value for transparent, auditable decision-making in safety-critical healthcare supply chain domains.
Paper Structure (40 sections, 6 equations, 7 figures, 1 algorithm)

This paper contains 40 sections, 6 equations, 7 figures, 1 algorithm.

Figures (7)

  • Figure 1: Sequential decision-making loop. The agent receives an observation and reward from the environment, selects an action according to its policy, and the environment transitions to a new state.
  • Figure 2: General model checking workflow DBLP:journals/sttt/HenselJKQV22. The system is formally modeled, the requirements are formalized, and both are input to a model checker such as Storm, which verifies the property.
  • Figure 3: Bounded reachability probabilities $P_{=?}(\lozenge^{\leq B}\;\textit{empty})$ and $P_{=?}(\lozenge^{\leq B}\;\textit{full})$ under the trained PPO policy for increasing horizon bounds $B \in \{150, 175, \dots, 400\}$. Both probabilities grow monotonically with the horizon, with stockout risk consistently exceeding inventory-full risk. The average relative ascent per interval is similar for both properties (${\approx}\,11.2\%$ for empty, ${\approx}\,10.2\%$ for full), indicating that neither risk dominates in its sensitivity to the planning horizon.
  • Figure 4: Expected number of steps to the first occurrence of each ordering action under the trained PPO policy, computed via PCTL queries $T_{=?}(\,F\;\texttt{"pr*}\texttt{"}\,)$ on the induced DTMC. The $y$-axis shows $\log_{10}(T)$; hatched bars indicate actions that are never selected ($T = \infty$). Actions with low $T$ dominate the policy's behaviour, while extreme order quantities are either rarely reached or entirely unreachable.
  • Figure 5: Relative change in the probability of reaching an empty inventory state, $P_{=?}(\lozenge^{\leq 200}\;\texttt{"empty"})$, when each feature is individually pruned from the learned PPO policy. Positive values (red) indicate that removing the feature increases the probability of a stockout, revealing features critical for the avoidance of inventory depletion. Notably, the freshest inventory levels $x_4$ and $x_5$ are by far the most influential, suggesting that the policy heavily relies on fresh stock information to prevent shortages.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Definition 1: MDP
  • Definition 2: DTMC
  • Definition 3: Observation
  • Definition 4