Reinforcement Learning in Dynamic Treatment Regimes Needs Critical Reexamination

Zhiyao Luo; Yangchen Pan; Peter Watkinson; Tingting Zhu

Reinforcement Learning in Dynamic Treatment Regimes Needs Critical Reexamination

Zhiyao Luo, Yangchen Pan, Peter Watkinson, Tingting Zhu

TL;DR

This position paper critically evaluates offline reinforcement learning for dynamic treatment regimes in healthcare, arguing that inconsistent evaluation metrics, missing naive and supervised baselines, and diverse MDP formulations undermine claims of RL efficacy. Through a Sepsis-based case study with over 17,000 evaluations, the authors show that RL performance is highly sensitive to reward design and policy evaluation methods, and that simple baselines can outperform RL in some settings. They demonstrate issues with policy evaluation estimators, including potential over-/under-estimation by Doubly Robust methods and the impact of calibration on importance weights. The work advocates a more standardized, rigorous evaluation framework, richer baselines, and careful reward design to ensure reliable, safe deployment of RL for DTRs in clinical practice, and provides code to support reproducibility.

Abstract

In the rapidly changing healthcare landscape, the implementation of offline reinforcement learning (RL) in dynamic treatment regimes (DTRs) presents a mix of unprecedented opportunities and challenges. This position paper offers a critical examination of the current status of offline RL in the context of DTRs. We argue for a reassessment of applying RL in DTRs, citing concerns such as inconsistent and potentially inconclusive evaluation metrics, the absence of naive and supervised learning baselines, and the diverse choice of RL formulation in existing research. Through a case study with more than 17,000 evaluation experiments using a publicly available Sepsis dataset, we demonstrate that the performance of RL algorithms can significantly vary with changes in evaluation metrics and Markov Decision Process (MDP) formulations. Surprisingly, it is observed that in some instances, RL algorithms can be surpassed by random baselines subjected to policy evaluation methods and reward design. This calls for more careful policy evaluation and algorithm development in future DTR works. Additionally, we discussed potential enhancements toward more reliable development of RL-based dynamic treatment regimes and invited further discussion within the community. Code is available at https://github.com/GilesLuo/ReassessDTR.

Reinforcement Learning in Dynamic Treatment Regimes Needs Critical Reexamination

TL;DR

Abstract

Paper Structure (46 sections, 18 equations, 23 figures, 46 tables)

This paper contains 46 sections, 18 equations, 23 figures, 46 tables.

Introduction
Background
RL and Offline RL
Dynamic Treatment Regime
Problem Formulation in Offline RL for DTR
Diversity and Inconsistency of Policy Evaluation Methods in RL-DTR
Challenges of Policy Evaluation in RL
Existing Evaluation Methods in RL-DTR
Reward Design Choices
Outcome-Based Reward
Risk-based Reward
ICU Risk-Based Reward
Early Warning Risk-Based Reward
Baselines Comparisons
Experiments
...and 31 more sections

Figures (23)

Figure 1: Number of wins for each policy in the (overall) test set. Wins are calculated based on the mean performance of 5 random seeds. Alt, min, max, random, and weight policies are naive baselines. This denotation applies to all the following figures.
Figure 2: A summed number of wins across patient subgroups stratified by mortality risk rate of change. This figure presents the cumulative performance of each algorithm, measured by the No. win across 12 stratified subsets derived from the test set. Wins are calculated for each algorithm within each subset across all metrics and subsequently aggregated to reflect overall performance. This approach allows for an assessment of the average algorithmic efficacy in various subgroups of patients, stratified by changes in mortality risk.
Figure 3: Behavioral and value estimator versus their losses on the testing set. The count in each bin is indicated by a colour bar, transitioning from blue to red as the number increases. (a) depicts the behavioral loss (samples with a cross-entropy loss $>$ 90th percentile ) versus the inference probability. (b), (c), and (d) show the direct method estimator loss (samples with L1 loss $>$90th percentile) on the outcome, SOFA, and NEWS2 reward, respectively.
Figure 4: Comparison of output probability between calibrated and uncalibrated $\hat{\pi_{\mathcal{D}}}$. The plot shows a histogram of output probability and the number of counts in the dataset with a logarithm scale on the y-axis on training, validation and test set, respectively. It is observed that the frequencies of extreme probabilities (i.e., probabilities near 0 and 1) are higher after calibration.
Figure 5: Importance ratio histogram of random policy $>$ 99th percentile. The horizontal axis includes different datasets, where 'All' means the test set and the rest are NEWS2 risk-stratified subsets, indexed by the ascending order of NEWS2 change rate. The calibrated model contains more extremely large ratios $>$ 99th percentile. Only ratio outliers (i.e., $>$ 99th percentile) are plotted for visualization convenience. To view the other 14 ratio plots for 5 baseline policies in 3 different reward settings, please see Appendix \ref{['sec:app-ratio plot']}.
...and 18 more figures

Reinforcement Learning in Dynamic Treatment Regimes Needs Critical Reexamination

TL;DR

Abstract

Reinforcement Learning in Dynamic Treatment Regimes Needs Critical Reexamination

Authors

TL;DR

Abstract

Table of Contents

Figures (23)