OMG-RL:Offline Model-based Guided Reward Learning for Heparin Treatment

Yooseok Lim; Sujee Lee

OMG-RL:Offline Model-based Guided Reward Learning for Heparin Treatment

Yooseok Lim, Sujee Lee

TL;DR

This paper tackles the challenge of reward design in reinforcement learning for medication dosing by introducing OMG-RL, an offline model-based guided reward learning framework that learns a reward function $r_\psi$ from clinician data. By integrating a probabilistic dynamic model, conservative policy evaluation, and MaxEnt IRL-based reward guidance, OMG-RL learns dosing policies from finite historical data (MIMIC-III) and rolls out plausible trajectories to improve performance without online data. Empirical results show that the learned reward correlates with clinically meaningful indicators (e.g., $aPTT$) and that OMG-RL achieves competitive or superior performance relative to model-based and model-free baselines, including higher policy reliability and agreement with clinicians in many regions. The approach holds promise for data-efficient, clinician-aligned RL in medication dosing and could generalize to other drugs and care pathways, though limitations such as discrete action spaces and the need for broader validation are noted.

Abstract

Accurate medication dosing holds an important position in the overall patient therapeutic process. Therefore, much research has been conducted to develop optimal administration strategy based on Reinforcement learning (RL). However, Relying solely on a few explicitly defined reward functions makes it difficult to learn a treatment strategy that encompasses the diverse characteristics of various patients. Moreover, the multitude of drugs utilized in clinical practice makes it infeasible to construct a dedicated reward function for each medication. Here, we tried to develop a reward network that captures clinicians' therapeutic intentions, departing from explicit rewards, and to derive an optimal heparin dosing policy. In this study, we introduce Offline Model-based Guided Reward Learning (OMG-RL), which performs offline inverse RL (IRL). Through OMG-RL, we learn a parameterized reward function that captures the expert's intentions from limited data, thereby enhancing the agent's policy. We validate the proposed approach on the heparin dosing task. We show that OMG-RL policy is positively reinforced not only in terms of the learned reward network but also in activated partial thromboplastin time (aPTT), a key indicator for monitoring the effects of heparin. This means that the OMG-RL policy adequately reflects clinician's intentions. This approach can be widely utilized not only for the heparin dosing problem but also for RL-based medication dosing tasks in general.

OMG-RL:Offline Model-based Guided Reward Learning for Heparin Treatment

TL;DR

from clinician data. By integrating a probabilistic dynamic model, conservative policy evaluation, and MaxEnt IRL-based reward guidance, OMG-RL learns dosing policies from finite historical data (MIMIC-III) and rolls out plausible trajectories to improve performance without online data. Empirical results show that the learned reward correlates with clinically meaningful indicators (e.g.,

) and that OMG-RL achieves competitive or superior performance relative to model-based and model-free baselines, including higher policy reliability and agreement with clinicians in many regions. The approach holds promise for data-efficient, clinician-aligned RL in medication dosing and could generalize to other drugs and care pathways, though limitations such as discrete action spaces and the need for broader validation are noted.

Abstract

Paper Structure (20 sections, 8 equations, 8 figures, 2 algorithms)

This paper contains 20 sections, 8 equations, 8 figures, 2 algorithms.

Introduction
Related work
Heparin Treatment with RL
Offline Model-Based RL
Online and Offline IRL
Background
Markov Decision Process (MDP)
Maximum Entropy IRL
Methods
Dynamic Model
Conservative Policy Evaluation and Improvement
Guided Reward
Experimental Setup
Dataset
Practical Implementation
...and 5 more sections

Figures (8)

Figure 1: Diagram of the OMG-RL framework.
Figure 2: Changes in Return ($r_p$) and Q-Value of COMBO.
Figure 3: Comparison of returns between model-based and model-free approaches: (a) returns ($r_{p}$) and (b) returns (WIS).
Figure 4: Changes in $r_\psi$ and $r_p$ of OMG-RL.
Figure 5: Comparison of normalized returns across three reward types ($r_\psi$, $r_p$, WIS) for OMG-RL, model-based, and model-free approaches.
...and 3 more figures

OMG-RL:Offline Model-based Guided Reward Learning for Heparin Treatment

TL;DR

Abstract

OMG-RL:Offline Model-based Guided Reward Learning for Heparin Treatment

Authors

TL;DR

Abstract

Table of Contents

Figures (8)