Two-way Deconfounder for Off-policy Evaluation in Causal Reinforcement Learning

Shuguang Yu; Shuxing Fang; Ruixin Peng; Zhengling Qi; Fan Zhou; Chengchun Shi

Two-way Deconfounder for Off-policy Evaluation in Causal Reinforcement Learning

Shuguang Yu, Shuxing Fang, Ruixin Peng, Zhengling Qi, Fan Zhou, Chengchun Shi

TL;DR

This work tackles off-policy evaluation under unmeasured confounding in causal reinforcement learning by introducing a two-way unmeasured confounding (TWUC) assumption that partitions latent confounders into trajectory-invariant $U_i$ and time-invariant $W_t$, with $Z_{i,t}=(U_i^\top, W_t^\top)^\top$. A two-way deconfounder (TWD) using a neural tensor network jointly learns the latent factors and environment dynamics, enabling a model-based OPE estimator that plugs in the learned confounders since the confounders are policy-agnostic. Theoretical guarantees deliver a finite-sample bound on the estimator error, showing linear-in-$T$ estimation error and vanishing standard deviations under mild autocorrelation. Empirically, TWD outperforms baselines on simulated tasks and a MIMIC-III real-world dataset, demonstrating accurate policy evaluation without external proxies and robustness to partial violations of TWUC. The approach advances practical OPE in high-stakes settings by enabling consistent value estimation under flexible latent confounding structures.

Abstract

This paper studies off-policy evaluation (OPE) in the presence of unmeasured confounders. Inspired by the two-way fixed effects regression model widely used in the panel data literature, we propose a two-way unmeasured confounding assumption to model the system dynamics in causal reinforcement learning and develop a two-way deconfounder algorithm that devises a neural tensor network to simultaneously learn both the unmeasured confounders and the system dynamics, based on which a model-based estimator can be constructed for consistent policy value estimation. We illustrate the effectiveness of the proposed estimator through theoretical results and numerical experiments.

Two-way Deconfounder for Off-policy Evaluation in Causal Reinforcement Learning

TL;DR

and time-invariant

, with

. A two-way deconfounder (TWD) using a neural tensor network jointly learns the latent factors and environment dynamics, enabling a model-based OPE estimator that plugs in the learned confounders since the confounders are policy-agnostic. Theoretical guarantees deliver a finite-sample bound on the estimator error, showing linear-in-

estimation error and vanishing standard deviations under mild autocorrelation. Empirically, TWD outperforms baselines on simulated tasks and a MIMIC-III real-world dataset, demonstrating accurate policy evaluation without external proxies and robustness to partial violations of TWUC. The approach advances practical OPE in high-stakes settings by enabling consistent value estimation under flexible latent confounding structures.

Two-way Deconfounder for Off-policy Evaluation in Causal Reinforcement Learning

TL;DR

Abstract

Two-way Deconfounder for Off-policy Evaluation in Causal Reinforcement Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)

Theorems & Definitions (4)