Optimizing Warfarin Dosing Using Contextual Bandit: An Offline Policy Learning and Evaluation Method
Yong Huang, Charles A. Downs, Amir M. Rahmani
TL;DR
The paper addresses the challenge of assigning personalized warfarin dosages by casting the problem as an offline contextual bandit and learning policies from historical data without online exploration. It uses two offline policy-learning methods—Offset Tree and Doubly Robust—to derive dosing policies and evaluates them with three off-policy estimators (Rejection Sampling, Doubly Robust, and NCIS). The results show that the learned policies can surpass baseline demonstrations, even when the demonstrations are suboptimal, and do so without genotype information, highlighting practical potential for real-world deployment. The study contributes the first offline ML approach to warfarin dosing, provides empirical evaluation of OPE tools in this domain, and emphasizes safety and scalability benefits for healthcare decision-making.
Abstract
Warfarin, an anticoagulant medication, is formulated to prevent and address conditions associated with abnormal blood clotting, making it one of the most prescribed drugs globally. However, determining the suitable dosage remains challenging due to individual response variations, and prescribing an incorrect dosage may lead to severe consequences. Contextual bandit and reinforcement learning have shown promise in addressing this issue. Given the wide availability of observational data and safety concerns of decision-making in healthcare, we focused on using exclusively observational data from historical policies as demonstrations to derive new policies; we utilized offline policy learning and evaluation in a contextual bandit setting to establish the optimal personalized dosage strategy. Our learned policies surpassed these baseline approaches without genotype inputs, even when given a suboptimal demonstration, showcasing promising application potential.
