Reinforcement Learning in Dynamic Treatment Regimes Needs Critical Reexamination
Zhiyao Luo, Yangchen Pan, Peter Watkinson, Tingting Zhu
TL;DR
This position paper critically evaluates offline reinforcement learning for dynamic treatment regimes in healthcare, arguing that inconsistent evaluation metrics, missing naive and supervised baselines, and diverse MDP formulations undermine claims of RL efficacy. Through a Sepsis-based case study with over 17,000 evaluations, the authors show that RL performance is highly sensitive to reward design and policy evaluation methods, and that simple baselines can outperform RL in some settings. They demonstrate issues with policy evaluation estimators, including potential over-/under-estimation by Doubly Robust methods and the impact of calibration on importance weights. The work advocates a more standardized, rigorous evaluation framework, richer baselines, and careful reward design to ensure reliable, safe deployment of RL for DTRs in clinical practice, and provides code to support reproducibility.
Abstract
In the rapidly changing healthcare landscape, the implementation of offline reinforcement learning (RL) in dynamic treatment regimes (DTRs) presents a mix of unprecedented opportunities and challenges. This position paper offers a critical examination of the current status of offline RL in the context of DTRs. We argue for a reassessment of applying RL in DTRs, citing concerns such as inconsistent and potentially inconclusive evaluation metrics, the absence of naive and supervised learning baselines, and the diverse choice of RL formulation in existing research. Through a case study with more than 17,000 evaluation experiments using a publicly available Sepsis dataset, we demonstrate that the performance of RL algorithms can significantly vary with changes in evaluation metrics and Markov Decision Process (MDP) formulations. Surprisingly, it is observed that in some instances, RL algorithms can be surpassed by random baselines subjected to policy evaluation methods and reward design. This calls for more careful policy evaluation and algorithm development in future DTR works. Additionally, we discussed potential enhancements toward more reliable development of RL-based dynamic treatment regimes and invited further discussion within the community. Code is available at https://github.com/GilesLuo/ReassessDTR.
