Table of Contents
Fetching ...

POLAR: A Pessimistic Model-based Policy Learning Algorithm for Dynamic Treatment Regimes

Ruijia Zhang, Xiangyu Zhang, Zhengling Qi, Yue Wu, Yanxun Xu

TL;DR

POLAR tackles offline optimization of dynamic treatment regimes by learning a pessimistic, model-based representation of transition dynamics and rewards from historical trajectories. It constructs a modified DTR model with uncertainty-based penalties, and optimizes a history-aware policy within a soft-max, actor-critic framework using a linear sieve. Theoretical guarantees decompose suboptimality into model-shift and policy-optimization components, with explicit rates for linear and GP transition models under partial data coverage, plus a near-minimax lower bound in the linear setting. Empirical results on synthetic data and MIMIC-III show POLAR achieving near-optimal policies and robust performance under data limitations and model misspecification, highlighting its practical value for sequential healthcare decision-making. The work advances offline DTRs by delivering both statistical and computational guarantees and by accommodating flexible function classes and partial coverage regimes, with potential extensions to asynchronous decision points and higher horizons.

Abstract

Dynamic treatment regimes (DTRs) provide a principled framework for optimizing sequential decision-making in domains where decisions must adapt over time in response to individual trajectories, such as healthcare, education, and digital interventions. However, existing statistical methods often rely on strong positivity assumptions and lack robustness under partial data coverage, while offline reinforcement learning approaches typically focus on average training performance, lack statistical guarantees, and require solving complex optimization problems. To address these challenges, we propose POLAR, a novel pessimistic model-based policy learning algorithm for offline DTR optimization. POLAR estimates the transition dynamics from offline data and quantifies uncertainty for each history-action pair. A pessimistic penalty is then incorporated into the reward function to discourage actions with high uncertainty. Unlike many existing methods that focus on average training performance or provide guarantees only for an oracle policy, POLAR directly targets the suboptimality of the final learned policy and offers theoretical guarantees, without relying on computationally intensive minimax or constrained optimization procedures. To the best of our knowledge, POLAR is the first model-based DTR method to provide both statistical and computational guarantees, including finite-sample bounds on policy suboptimality. Empirical results on both synthetic data and the MIMIC-III dataset demonstrate that POLAR outperforms state-of-the-art methods and yields near-optimal, history-aware treatment strategies.

POLAR: A Pessimistic Model-based Policy Learning Algorithm for Dynamic Treatment Regimes

TL;DR

POLAR tackles offline optimization of dynamic treatment regimes by learning a pessimistic, model-based representation of transition dynamics and rewards from historical trajectories. It constructs a modified DTR model with uncertainty-based penalties, and optimizes a history-aware policy within a soft-max, actor-critic framework using a linear sieve. Theoretical guarantees decompose suboptimality into model-shift and policy-optimization components, with explicit rates for linear and GP transition models under partial data coverage, plus a near-minimax lower bound in the linear setting. Empirical results on synthetic data and MIMIC-III show POLAR achieving near-optimal policies and robust performance under data limitations and model misspecification, highlighting its practical value for sequential healthcare decision-making. The work advances offline DTRs by delivering both statistical and computational guarantees and by accommodating flexible function classes and partial coverage regimes, with potential extensions to asynchronous decision points and higher horizons.

Abstract

Dynamic treatment regimes (DTRs) provide a principled framework for optimizing sequential decision-making in domains where decisions must adapt over time in response to individual trajectories, such as healthcare, education, and digital interventions. However, existing statistical methods often rely on strong positivity assumptions and lack robustness under partial data coverage, while offline reinforcement learning approaches typically focus on average training performance, lack statistical guarantees, and require solving complex optimization problems. To address these challenges, we propose POLAR, a novel pessimistic model-based policy learning algorithm for offline DTR optimization. POLAR estimates the transition dynamics from offline data and quantifies uncertainty for each history-action pair. A pessimistic penalty is then incorporated into the reward function to discourage actions with high uncertainty. Unlike many existing methods that focus on average training performance or provide guarantees only for an oracle policy, POLAR directly targets the suboptimality of the final learned policy and offers theoretical guarantees, without relying on computationally intensive minimax or constrained optimization procedures. To the best of our knowledge, POLAR is the first model-based DTR method to provide both statistical and computational guarantees, including finite-sample bounds on policy suboptimality. Empirical results on both synthetic data and the MIMIC-III dataset demonstrate that POLAR outperforms state-of-the-art methods and yields near-optimal, history-aware treatment strategies.

Paper Structure

This paper contains 41 sections, 30 theorems, 248 equations, 5 figures, 3 algorithms.

Key Result

Theorem 1

Suppose Assumptions assumption: uncertainty quantifier and assumption: sup norm of e hold. Let the hyperparameters in Algorithm algo: general be set as $\widetilde{c}_k = \sum_{j=k}^K \|\overline{r}_j\|_\infty$, and choose the stepsizes as $\eta_k^{(t)} = \frac{c_\eta}{\sqrt{T}}$. Then, with probabi Here $V^{\pi^\dagger} _{P^*, \underline{r}}$ denotes the value function of $\pi^\dagger$ under the

Figures (5)

  • Figure 1: Policy value vs. iterations for different values of $c$ ($p$=0.75, $n$=200).
  • Figure 2: Policy value averaged over 100 repeated simulations, for different $n,p,c$.
  • Figure 3: Policy value versus number of iterations for POLAR with different values of $c$, compared to DDQN, DTR-$Q$, MOPO, and MILO. All policy values are evaluated using OPE.
  • Figure 4: The action distributions generated by the physician policy, POLAR ($c=20$), DTR-$Q$, DDQN, MILO and MOPO across three decision stages.
  • Figure S1: Policy values under transition model misspecification. Data is generated from a linear model, but a Gaussian Process (GP) model is used for estimation. Policy value averaged over 100 repeated simulations, for different $n,p,c$.

Theorems & Definitions (62)

  • Theorem 1
  • Proposition 1
  • Theorem 2: Suboptimality of the Linear Transition Model
  • Theorem 3: Minimax Lower Bound of the Linear Transition Model
  • Corollary 1: Sample Complexity for Linear Transition Model
  • Remark 1
  • Theorem 4: Suboptimality of GP Model
  • Corollary 2: Sample Complexity of GP Model
  • Proposition S1
  • Lemma S1
  • ...and 52 more