Pessimistic Causal Reinforcement Learning with Mediators for Confounded Offline Data
Danyang Wang, Chengchun Shi, Shikai Luo, Will Wei Sun
TL;DR
The paper tackles offline policy learning with unmeasured confounding and distributional shift by introducing a mediated MDP (M2DP) and employing the front-door criterion to identify interventional effects via observed mediators. It advances two algorithms, CAL and PESCAL, where CAL learns a mediated Q-function and PESCAL adds a pointwise uncertainty bound on the mediator distribution to implement pessimism, avoiding full Q-function uncertainty quantification. Theoretical guarantees establish the existence of an optimal mediated policy and convergence/regret bounds for the proposed methods under realistic mixing and coverage assumptions. Empirically, PESCAL outperforms standard offline RL baselines on synthetic confounded data and yields meaningful improvements on real ride-hailing offline data, demonstrating robustness to confounding and distributional shift with practical impact.
Abstract
In real-world scenarios, datasets collected from randomized experiments are often constrained by size, due to limitations in time and budget. As a result, leveraging large observational datasets becomes a more attractive option for achieving high-quality policy learning. However, most existing offline reinforcement learning (RL) methods depend on two key assumptions--unconfoundedness and positivity--which frequently do not hold in observational data contexts. Recognizing these challenges, we propose a novel policy learning algorithm, PESsimistic CAusal Learning (PESCAL). We utilize the mediator variable based on front-door criterion to remove the confounding bias; additionally, we adopt the pessimistic principle to address the distributional shift between the action distributions induced by candidate policies, and the behavior policy that generates the observational data. Our key observation is that, by incorporating auxiliary variables that mediate the effect of actions on system dynamics, it is sufficient to learn a lower bound of the mediator distribution function, instead of the Q-function, to partially mitigate the issue of distributional shift. This insight significantly simplifies our algorithm, by circumventing the challenging task of sequential uncertainty quantification for the estimated Q-function. Moreover, we provide theoretical guarantees for the algorithms we propose, and demonstrate their efficacy through simulations, as well as real-world experiments utilizing offline datasets from a leading ride-hailing platform.
