Table of Contents
Fetching ...

Two-way Deconfounder for Off-policy Evaluation in Causal Reinforcement Learning

Shuguang Yu, Shuxing Fang, Ruixin Peng, Zhengling Qi, Fan Zhou, Chengchun Shi

TL;DR

This work tackles off-policy evaluation under unmeasured confounding in causal reinforcement learning by introducing a two-way unmeasured confounding (TWUC) assumption that partitions latent confounders into trajectory-invariant $U_i$ and time-invariant $W_t$, with $Z_{i,t}=(U_i^\top, W_t^\top)^\top$. A two-way deconfounder (TWD) using a neural tensor network jointly learns the latent factors and environment dynamics, enabling a model-based OPE estimator that plugs in the learned confounders since the confounders are policy-agnostic. Theoretical guarantees deliver a finite-sample bound on the estimator error, showing linear-in-$T$ estimation error and vanishing standard deviations under mild autocorrelation. Empirically, TWD outperforms baselines on simulated tasks and a MIMIC-III real-world dataset, demonstrating accurate policy evaluation without external proxies and robustness to partial violations of TWUC. The approach advances practical OPE in high-stakes settings by enabling consistent value estimation under flexible latent confounding structures.

Abstract

This paper studies off-policy evaluation (OPE) in the presence of unmeasured confounders. Inspired by the two-way fixed effects regression model widely used in the panel data literature, we propose a two-way unmeasured confounding assumption to model the system dynamics in causal reinforcement learning and develop a two-way deconfounder algorithm that devises a neural tensor network to simultaneously learn both the unmeasured confounders and the system dynamics, based on which a model-based estimator can be constructed for consistent policy value estimation. We illustrate the effectiveness of the proposed estimator through theoretical results and numerical experiments.

Two-way Deconfounder for Off-policy Evaluation in Causal Reinforcement Learning

TL;DR

This work tackles off-policy evaluation under unmeasured confounding in causal reinforcement learning by introducing a two-way unmeasured confounding (TWUC) assumption that partitions latent confounders into trajectory-invariant and time-invariant , with . A two-way deconfounder (TWD) using a neural tensor network jointly learns the latent factors and environment dynamics, enabling a model-based OPE estimator that plugs in the learned confounders since the confounders are policy-agnostic. Theoretical guarantees deliver a finite-sample bound on the estimator error, showing linear-in- estimation error and vanishing standard deviations under mild autocorrelation. Empirically, TWD outperforms baselines on simulated tasks and a MIMIC-III real-world dataset, demonstrating accurate policy evaluation without external proxies and robustness to partial violations of TWUC. The approach advances practical OPE in high-stakes settings by enabling consistent value estimation under flexible latent confounding structures.

Abstract

This paper studies off-policy evaluation (OPE) in the presence of unmeasured confounders. Inspired by the two-way fixed effects regression model widely used in the panel data literature, we propose a two-way unmeasured confounding assumption to model the system dynamics in causal reinforcement learning and develop a two-way deconfounder algorithm that devises a neural tensor network to simultaneously learn both the unmeasured confounders and the system dynamics, based on which a model-based estimator can be constructed for consistent policy value estimation. We illustrate the effectiveness of the proposed estimator through theoretical results and numerical experiments.

Paper Structure

This paper contains 29 sections, 36 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: The directed acyclic graphs of data generating processes under different assumptions. $(a):$$\{z_{i,t}\}_{i,t}$ (colored in blue) are unconstrained unmeasured confounders. $(b):$$\{h_i\}_i$ (colored in green) are one-way unmeasured confounders. $(c):$$\{u_i\}_i$ and $\{w_t\}_t$ (colored in orange) are two-way unmeasured confounders.
  • Figure 2: $(a):$ An overview of the proposed network architecture. $(b):$ The upper panel reports MSEs under different unmeasured confounding assumptions for fitting the observed data whereas the bottom panel displays the MSEs for off-policy value prediction. The unconstrained unmeasured confounding model shows the best fit for the training data, due to overfitting. The OPE estimator under the proposed two-way unmeasured confounding achieves the smallest MSE. More details are referred to \ref{['ap:linear simulation setting']}.
  • Figure 3: Logarithmic MSE and Bias of various estimators for the simulated dynamic process and tumor growth example.
  • Figure 4: $(a):$ The estimated policy value for four target policies in real-world dataset. $(b):$ Average root $\mathrm{MSE}$ and its standard error in the results for predicting immediate reward and next observation. The results are aggregated over 20 runs.
  • Figure 5: Sensitivity analysis for the simulated dynamic process and tumor growth experiment.

Theorems & Definitions (4)

  • proof
  • proof
  • proof
  • proof