Pessimistic Causal Reinforcement Learning with Mediators for Confounded Offline Data

Danyang Wang; Chengchun Shi; Shikai Luo; Will Wei Sun

Pessimistic Causal Reinforcement Learning with Mediators for Confounded Offline Data

Danyang Wang, Chengchun Shi, Shikai Luo, Will Wei Sun

TL;DR

The paper tackles offline policy learning with unmeasured confounding and distributional shift by introducing a mediated MDP (M2DP) and employing the front-door criterion to identify interventional effects via observed mediators. It advances two algorithms, CAL and PESCAL, where CAL learns a mediated Q-function and PESCAL adds a pointwise uncertainty bound on the mediator distribution to implement pessimism, avoiding full Q-function uncertainty quantification. Theoretical guarantees establish the existence of an optimal mediated policy and convergence/regret bounds for the proposed methods under realistic mixing and coverage assumptions. Empirically, PESCAL outperforms standard offline RL baselines on synthetic confounded data and yields meaningful improvements on real ride-hailing offline data, demonstrating robustness to confounding and distributional shift with practical impact.

Abstract

In real-world scenarios, datasets collected from randomized experiments are often constrained by size, due to limitations in time and budget. As a result, leveraging large observational datasets becomes a more attractive option for achieving high-quality policy learning. However, most existing offline reinforcement learning (RL) methods depend on two key assumptions--unconfoundedness and positivity--which frequently do not hold in observational data contexts. Recognizing these challenges, we propose a novel policy learning algorithm, PESsimistic CAusal Learning (PESCAL). We utilize the mediator variable based on front-door criterion to remove the confounding bias; additionally, we adopt the pessimistic principle to address the distributional shift between the action distributions induced by candidate policies, and the behavior policy that generates the observational data. Our key observation is that, by incorporating auxiliary variables that mediate the effect of actions on system dynamics, it is sufficient to learn a lower bound of the mediator distribution function, instead of the Q-function, to partially mitigate the issue of distributional shift. This insight significantly simplifies our algorithm, by circumventing the challenging task of sequential uncertainty quantification for the estimated Q-function. Moreover, we provide theoretical guarantees for the algorithms we propose, and demonstrate their efficacy through simulations, as well as real-world experiments utilizing offline datasets from a leading ride-hailing platform.

Pessimistic Causal Reinforcement Learning with Mediators for Confounded Offline Data

TL;DR

Abstract

Paper Structure (17 sections, 9 theorems, 75 equations, 6 figures, 2 algorithms)

This paper contains 17 sections, 9 theorems, 75 equations, 6 figures, 2 algorithms.

Introduction
Contributions
Related Works
Organization of the Paper
Preliminaries
Confounded MDP with Mediator
The Pessimistic Principle and Uncertainty Quantification
Pessimistic Causal Learning
The Optimal Policy in A Confounded M2DP
Pessimistic Causal Q-learning
Theoretical Guarantees
Simulation
Real Data Application
Discussion
Proof of Theorem \ref{['thm1']}
...and 2 more sections

Key Result

Theorem 1

Under Assumption asm:frontdoor_crit, (i) there exists an unique optimal deterministic stationary policy $\pi^*$ whose $J(\pi^*)$ is no worse than any other policies; (ii) $\pi^*$ is greedy with respect to a weighted average of the $Q^{*}$. In particular, (iii) $Q^*$ satisfies the following Bellman optimality equation:

Figures (6)

Figure 1: Causal relationships in Markov Decision Process. (a) standard MDP; (b) MDP with unobserved confounder. $S, A, R \text{ and } C$ represent state, action, reward and confounder, respectively. Arrows denote causal relationship. Solid lines indicate observed variables (or relationships), while dotted lines indicate unobserved variables (or relationships).
Figure 2: Mediated Markov Decision Processes (M2DPs): (a) We assume that the offline dataset generating process follows an offline confounded M2DP, as illustrated in the diagram; (b) Online unconfounded M2DP. Again, dotted lines indicate that the relationship (or variable) is not directly observable. We aim to learn an optimal policy $\pi^*$ that can be executed online without the confounding effect, as shown in (b), while training on a static dataset $\mathcal{D}$ generated by the process described in (a).
Figure 3: Example distributions of estimated expected reward for policy learning in a three-armed bandit. The oracle expected rewards for the three arms $a_1$, $a_2$ and $a_3$ are given by $\mu_1>\mu_2>\mu_3$. However, due to the limited sample size for the second arm, the estimated expected reward $\widehat{\mu}_2$, is subject to a significant variance. Consequently, $\widehat{\mu}_2>\widehat{\mu}_1>\widehat{\mu}_3$ occurs with a high probability, as illustrated in the graph.
Figure 4: Confounding relationship in synthetic data conditional on state. This figure depicts the impact of an unobserved confounder ($C_t$) on both action ($A_t$) and reward ($R_t$). Larger "$+$" or "$-$" symbols indicate stronger positive or negative correlations, respectively. This plot highlights how confounding can mislead policy learning, and how the mediator variable ($M_t$) elucidates the genuine relationship between action and reward.
Figure 5: Online return of the learned policy across training iterations in the synthetic data experiments. The top row, corresponding to the confounded M2DP as shown in Figure \ref{['img: medi']}, contrasts with the second row, which depicts a standard MDP free from confounding effects (Figure \ref{['img: online_mdp']}). The distinguishing feature between these two is the absence of a confounder in the second row; otherwise, both share the same environment. Furthermore, in cases (a) and (b), a certain number of original tuples are retained, while all tuples featuring suboptimal actions are eliminated from the remaining data. In contrast, case (c) retains all original data, which includes a total of 50,000 tuples.
...and 1 more figures

Theorems & Definitions (10)

Theorem 1: Existence and uniqueness of optimal policy
Theorem 2: Convergence of CAL
Theorem 3: Convergence of PESCAL
Definition 1: Contraction mapping nadler1969multi
Lemma S1: Banach fixed point theorem banach1922operations
Lemma S2: Bellman Contraction Mapping
Theorem 4
Lemma S3
Lemma S4
Lemma S5

Pessimistic Causal Reinforcement Learning with Mediators for Confounded Offline Data

TL;DR

Abstract

Pessimistic Causal Reinforcement Learning with Mediators for Confounded Offline Data

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (10)