Table of Contents
Fetching ...

Model-Based Reinforcement Learning Under Confounding

Nishanth Venkatesh, Andreas A. Malikopoulos

TL;DR

The paper addresses model-based reinforcement learning under unobserved confounding in contextual MDPs by reframing the problem as a POMDP and applying proximal off-policy evaluation to deconfound the reward term using observable proxies. It combines a behavior-averaged surrogate transition model with MaxCausalEnt model learning to produce a Bellman-consistent surrogate MDP for state-based policies. A sequence of proxy-based identifications yields an observable, identifiable reward term that enables principled offline learning and planning in confounded environments. Empirical results in a synthetic clinical setting show improved long-horizon accuracy and modest performance gains over naive baselines, highlighting practical impact for domains with unrecorded contextual information.

Abstract

We investigate model-based reinforcement learning in contextual Markov decision processes (C-MDPs) in which the context is unobserved and induces confounding in the offline dataset. In such settings, conventional model-learning methods are fundamentally inconsistent, as the transition and reward mechanisms generated under a behavioral policy do not correspond to the interventional quantities required for evaluating a state-based policy. To address this issue, we adapt a proximal off-policy evaluation approach that identifies the confounded reward expectation using only observable state-action-reward trajectories under mild invertibility conditions on proxy variables. When combined with a behavior-averaged transition model, this construction yields a surrogate MDP whose Bellman operator is well defined and consistent for state-based policies, and which integrates seamlessly with the maximum causal entropy (MaxCausalEnt) model-learning framework. The proposed formulation enables principled model learning and planning in confounded environments where contextual information is unobserved, unavailable, or impractical to collect.

Model-Based Reinforcement Learning Under Confounding

TL;DR

The paper addresses model-based reinforcement learning under unobserved confounding in contextual MDPs by reframing the problem as a POMDP and applying proximal off-policy evaluation to deconfound the reward term using observable proxies. It combines a behavior-averaged surrogate transition model with MaxCausalEnt model learning to produce a Bellman-consistent surrogate MDP for state-based policies. A sequence of proxy-based identifications yields an observable, identifiable reward term that enables principled offline learning and planning in confounded environments. Empirical results in a synthetic clinical setting show improved long-horizon accuracy and modest performance gains over naive baselines, highlighting practical impact for domains with unrecorded contextual information.

Abstract

We investigate model-based reinforcement learning in contextual Markov decision processes (C-MDPs) in which the context is unobserved and induces confounding in the offline dataset. In such settings, conventional model-learning methods are fundamentally inconsistent, as the transition and reward mechanisms generated under a behavioral policy do not correspond to the interventional quantities required for evaluating a state-based policy. To address this issue, we adapt a proximal off-policy evaluation approach that identifies the confounded reward expectation using only observable state-action-reward trajectories under mild invertibility conditions on proxy variables. When combined with a behavior-averaged transition model, this construction yields a surrogate MDP whose Bellman operator is well defined and consistent for state-based policies, and which integrates seamlessly with the maximum causal entropy (MaxCausalEnt) model-learning framework. The proposed formulation enables principled model learning and planning in confounded environments where contextual information is unobserved, unavailable, or impractical to collect.

Paper Structure

This paper contains 14 sections, 5 theorems, 51 equations, 2 figures.

Key Result

Lemma 1

At each $t$, for any trajectory $\tau_t = (u_{0:t}, y_{0:t}, s_{0:t})$, the policy-induced trajectory distribution satisfies

Figures (2)

  • Figure 1: Causal graph of the C-MDP.
  • Figure 2: Comparison of multi-step rollout error

Theorems & Definitions (12)

  • Remark 1
  • Remark 2
  • Lemma 1: Expansion of the Policy--Dependent Trajectory Distribution
  • proof
  • Lemma 2: Reward Distribution Decomposition
  • proof
  • Lemma 3: First Proxy: Incorporating $Y_{t-1}$
  • proof
  • Lemma 4: Second Proxy: Incorporating the current observation $Y_t$
  • proof
  • ...and 2 more