Table of Contents
Fetching ...

Optimizing Warfarin Dosing Using Contextual Bandit: An Offline Policy Learning and Evaluation Method

Yong Huang, Charles A. Downs, Amir M. Rahmani

TL;DR

The paper addresses the challenge of assigning personalized warfarin dosages by casting the problem as an offline contextual bandit and learning policies from historical data without online exploration. It uses two offline policy-learning methods—Offset Tree and Doubly Robust—to derive dosing policies and evaluates them with three off-policy estimators (Rejection Sampling, Doubly Robust, and NCIS). The results show that the learned policies can surpass baseline demonstrations, even when the demonstrations are suboptimal, and do so without genotype information, highlighting practical potential for real-world deployment. The study contributes the first offline ML approach to warfarin dosing, provides empirical evaluation of OPE tools in this domain, and emphasizes safety and scalability benefits for healthcare decision-making.

Abstract

Warfarin, an anticoagulant medication, is formulated to prevent and address conditions associated with abnormal blood clotting, making it one of the most prescribed drugs globally. However, determining the suitable dosage remains challenging due to individual response variations, and prescribing an incorrect dosage may lead to severe consequences. Contextual bandit and reinforcement learning have shown promise in addressing this issue. Given the wide availability of observational data and safety concerns of decision-making in healthcare, we focused on using exclusively observational data from historical policies as demonstrations to derive new policies; we utilized offline policy learning and evaluation in a contextual bandit setting to establish the optimal personalized dosage strategy. Our learned policies surpassed these baseline approaches without genotype inputs, even when given a suboptimal demonstration, showcasing promising application potential.

Optimizing Warfarin Dosing Using Contextual Bandit: An Offline Policy Learning and Evaluation Method

TL;DR

The paper addresses the challenge of assigning personalized warfarin dosages by casting the problem as an offline contextual bandit and learning policies from historical data without online exploration. It uses two offline policy-learning methods—Offset Tree and Doubly Robust—to derive dosing policies and evaluates them with three off-policy estimators (Rejection Sampling, Doubly Robust, and NCIS). The results show that the learned policies can surpass baseline demonstrations, even when the demonstrations are suboptimal, and do so without genotype information, highlighting practical potential for real-world deployment. The study contributes the first offline ML approach to warfarin dosing, provides empirical evaluation of OPE tools in this domain, and emphasizes safety and scalability benefits for healthcare decision-making.

Abstract

Warfarin, an anticoagulant medication, is formulated to prevent and address conditions associated with abnormal blood clotting, making it one of the most prescribed drugs globally. However, determining the suitable dosage remains challenging due to individual response variations, and prescribing an incorrect dosage may lead to severe consequences. Contextual bandit and reinforcement learning have shown promise in addressing this issue. Given the wide availability of observational data and safety concerns of decision-making in healthcare, we focused on using exclusively observational data from historical policies as demonstrations to derive new policies; we utilized offline policy learning and evaluation in a contextual bandit setting to establish the optimal personalized dosage strategy. Our learned policies surpassed these baseline approaches without genotype inputs, even when given a suboptimal demonstration, showcasing promising application potential.
Paper Structure (7 sections, 2 figures, 3 tables)

This paper contains 7 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Workflow of offline learning and evaluation, an essential distinct between contextual bandit and supervised learning is that the ground truth optimal action is not revealed to learning and evaluation algorithms, making it close to real-world decision-making problems where the outcome associated with the optimal action may be counterfactual and unavailable in observational data.
  • Figure 2: The expected reward of thirty experiments on test sets is presented in a boxplot. In each subfigure, offset tree and doubly robust estimator learn from a corresponding old policy.