Table of Contents
Fetching ...

An Orthogonal Learner for Individualized Outcomes in Markov Decision Processes

Emil Javurek, Valentyn Melnychuk, Jonas Schweisthal, Konstantin Hess, Dennis Frauen, Stefan Feuerriegel

TL;DR

This work addresses estimating individualized long-horizon outcomes in MDPs from observational data by reframing off-policy $Q$-function estimation through causal inference. It introduces the DR$Q$-learner, a two-stage meta-learner that achieves double robustness, Neyman-orthogonality, and quasi-oracle efficiency, and is applicable to both discrete and continuous state spaces with flexible function classes. The authors derive an efficient influence-function-based Neyman-orthogonal loss $L^{3}_{Cpi_e}(B7,g)$ and prove its minimizer is the target $Q_{Cpi_e}$, with stability under nuisance misspecification. Empirically, DR$Q$-learners outperform plug-in baselines (Q-regression and FQE), particularly in settings with long horizons and low overlap, indicating practical value for personalized medicine and sequential decision-making under observational data constraints.

Abstract

Predicting individualized potential outcomes in sequential decision-making is central for optimizing therapeutic decisions in personalized medicine (e.g., which dosing sequence to give to a cancer patient). However, predicting potential outcomes over long horizons is notoriously difficult. Existing methods that break the curse of the horizon typically lack strong theoretical guarantees such as orthogonality and quasi-oracle efficiency. In this paper, we revisit the problem of predicting individualized potential outcomes in sequential decision-making (i.e., estimating Q-functions in Markov decision processes with observational data) through a causal inference lens. In particular, we develop a comprehensive theoretical foundation for meta-learners in this setting with a focus on beneficial theoretical properties. As a result, we yield a novel meta-learner called DRQ-learner and establish that it is: (1) doubly robust (i.e., valid inference under the misspecification of one of the nuisances), (2) Neyman-orthogonal (i.e., insensitive to first-order estimation errors in the nuisance functions), and (3) achieves quasi-oracle efficiency (i.e., behaves asymptotically as if the ground-truth nuisance functions were known). Our DRQ-learner is applicable to settings with both discrete and continuous state spaces. Further, our DRQ-learner is flexible and can be used together with arbitrary machine learning models (e.g., neural networks). We validate our theoretical results through numerical experiments, thereby showing that our meta-learner outperforms state-of-the-art baselines.

An Orthogonal Learner for Individualized Outcomes in Markov Decision Processes

TL;DR

This work addresses estimating individualized long-horizon outcomes in MDPs from observational data by reframing off-policy -function estimation through causal inference. It introduces the DR-learner, a two-stage meta-learner that achieves double robustness, Neyman-orthogonality, and quasi-oracle efficiency, and is applicable to both discrete and continuous state spaces with flexible function classes. The authors derive an efficient influence-function-based Neyman-orthogonal loss and prove its minimizer is the target , with stability under nuisance misspecification. Empirically, DR-learners outperform plug-in baselines (Q-regression and FQE), particularly in settings with long horizons and low overlap, indicating practical value for personalized medicine and sequential decision-making under observational data constraints.

Abstract

Predicting individualized potential outcomes in sequential decision-making is central for optimizing therapeutic decisions in personalized medicine (e.g., which dosing sequence to give to a cancer patient). However, predicting potential outcomes over long horizons is notoriously difficult. Existing methods that break the curse of the horizon typically lack strong theoretical guarantees such as orthogonality and quasi-oracle efficiency. In this paper, we revisit the problem of predicting individualized potential outcomes in sequential decision-making (i.e., estimating Q-functions in Markov decision processes with observational data) through a causal inference lens. In particular, we develop a comprehensive theoretical foundation for meta-learners in this setting with a focus on beneficial theoretical properties. As a result, we yield a novel meta-learner called DRQ-learner and establish that it is: (1) doubly robust (i.e., valid inference under the misspecification of one of the nuisances), (2) Neyman-orthogonal (i.e., insensitive to first-order estimation errors in the nuisance functions), and (3) achieves quasi-oracle efficiency (i.e., behaves asymptotically as if the ground-truth nuisance functions were known). Our DRQ-learner is applicable to settings with both discrete and continuous state spaces. Further, our DRQ-learner is flexible and can be used together with arbitrary machine learning models (e.g., neural networks). We validate our theoretical results through numerical experiments, thereby showing that our meta-learner outperforms state-of-the-art baselines.

Paper Structure

This paper contains 20 sections, 6 theorems, 55 equations, 6 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Under Assumptions (1)--(3) from above, the causal estimand in Eq. (eq:causal_quantity) is identifiable from the observed data of trajectories via

Figures (6)

  • Figure 1: Our contributions are located at the intersection of causal inference & orthogonal statistical learning and MDPs. Our problem setup is in : we estimate Q-functions in MDPs from off-policy data. Baselines for this task break the curse of the horizon but typically lack strong theoretical guarantees. Our method adopts concepts from : we obtain a novel meta-learner called DR$Q$-learner that is doubly robust, Neyman-orthogonal, and quasi-oracle efficient.
  • Figure 2: Our task: we aim to estimate $Q_{\textcolor{red_pie}{\pi_e}}$, a functional of the unobserved evaluation policy $\textcolor{red_pie}{\pi_e}$ (right), from the observational dataset $\mathcal{D}_{\pi_b}$ from the behavioral policy $\pi_b$ (left). A trajectory from a time-invariant Markov decision process (MDP) is determined by environment dynamics (gray) and by selecting actions according to a policy. We observe the MDP with $\pi_b$ (top left), while a potential MDP with $\pi_e$ (top right) is unobserved. Our target estimand $Q_{\textcolor{red_pie}{\pi_e}}$ must thus be estimated from available observational data $\mathcal{D}_{\pi_b}$.
  • Figure 3: Comparison. After observing the data $\mathcal{D}_{\pi_b}$, the learner-specific nuisance functions are estimated first, followed by the actual estimand. = our DR$Q$-learner. Learners suffering from plug-in bias are marked with ✗.
  • Figure 4: Setting A: Unrestricted model class $\mathcal{G}$. The results confirm the theoretical properties: our DR$Q$-learner in blue is better than the plug-in learners in red/orange, robust for varying lengths of the horizon, and is especially effective for settings with low overlap.
  • Figure 5: Setting B: linear model class $\mathcal{G}$. The results confirm that our theory and thus our DR$Q$-learner (in blue) are applicable to different (restricted) function classes.
  • ...and 1 more figures

Theorems & Definitions (15)

  • Theorem 1: Identification over trajectories
  • proof
  • Theorem 2: Identification over one-step transitions
  • proof
  • Theorem 3: Neyman-orthogonality
  • proof
  • Theorem 4: Quasi-oracle efficiency
  • Corollary 1: Double robustness
  • proof
  • Lemma 1: Expected TD error is zero
  • ...and 5 more