Table of Contents
Fetching ...

Instrumental Variable Value Iteration for Causal Offline Reinforcement Learning

Luofeng Liao, Zuyue Fu, Zhuoran Yang, Yixin Wang, Mladen Kolar, Zhaoran Wang

TL;DR

A provably efficient IV-aided Value Iteration (IVVI) algorithm based on a primal-dual reformulation of the conditional moment restriction is proposed, which is the first provably efficient algorithm for instrument-aided offline RL.

Abstract

In offline reinforcement learning (RL) an optimal policy is learned solely from a priori collected observational data. However, in observational data, actions are often confounded by unobserved variables. Instrumental variables (IVs), in the context of RL, are the variables whose influence on the state variables is all mediated by the action. When a valid instrument is present, we can recover the confounded transition dynamics through observational data. We study a confounded Markov decision process where the transition dynamics admit an additive nonlinear functional form. Using IVs, we derive a conditional moment restriction through which we can identify transition dynamics based on observational data. We propose a provably efficient IV-aided Value Iteration (IVVI) algorithm based on a primal-dual reformulation of the conditional moment restriction. To our knowledge, this is the first provably efficient algorithm for instrument-aided offline RL.

Instrumental Variable Value Iteration for Causal Offline Reinforcement Learning

TL;DR

A provably efficient IV-aided Value Iteration (IVVI) algorithm based on a primal-dual reformulation of the conditional moment restriction is proposed, which is the first provably efficient algorithm for instrument-aided offline RL.

Abstract

In offline reinforcement learning (RL) an optimal policy is learned solely from a priori collected observational data. However, in observational data, actions are often confounded by unobserved variables. Instrumental variables (IVs), in the context of RL, are the variables whose influence on the state variables is all mediated by the action. When a valid instrument is present, we can recover the confounded transition dynamics through observational data. We study a confounded Markov decision process where the transition dynamics admit an additive nonlinear functional form. Using IVs, we derive a conditional moment restriction through which we can identify transition dynamics based on observational data. We propose a provably efficient IV-aided Value Iteration (IVVI) algorithm based on a primal-dual reformulation of the conditional moment restriction. To our knowledge, this is the first provably efficient algorithm for instrument-aided offline RL.

Paper Structure

This paper contains 36 sections, 14 theorems, 143 equations, 9 figures, 9 tables, 1 algorithm.

Key Result

Proposition 1

If $(x,a,z,x')$ is distributed according to the law $\bar{d}_{\pi_b}$, then for any $z\in \mathcal{Z}$,

Figures (9)

  • Figure 1: The NICU application, adapted from chen2021estimating. Sufficient covariates have been conditioned on. Top panel: DAG representing data generation process where UCs are present. Bottom panel: DAG representing a prenatal regionalization system in action.
  • Figure 2: Left panel: An illustration of Definition \ref{['def:iv']} with one UC $\varepsilon$ and three observable variables $X$, $Y$, and $Z$. Right panel: Observation setting of CMDP-IV with a behavior policy $\pi_b$ (left). Evaluation setting of CMDP-IV with intervention induced by $\pi$ (right).
  • Figure 3: Experiment results for the parametric setting. The gradient descent ascent loss $\|W^t - W^*\|_{F}$ for different settings of instrument strength under different dimensions
  • Figure 4: Experiment results for the parametric setting. Top panel: The performance curves (with 95% confidence interval) of reward versus the time steps for different transition functions (without baseline). Bottom panel: The performance curves (with 95% confidence interval) of reward versus the time steps for different transition functions (including the ordinary regression baseline). The time step is the episode for SPEDE.
  • Figure 5: Robustness check experiment results for the parametric setting. Top panel: The gradient descent ascent loss $\|W^{t}-W^{sad}\|_{F}$ for different settings of instrument strength under different dimensions. Bottom panel: The performance curves (with 95% confidence interval) of reward versus the time steps for different transition functions (without baseline). The time step is the episode for SPEDE.
  • ...and 4 more figures

Theorems & Definitions (27)

  • Example 1: (Recommendation as an IV, MovieLens 1M data)
  • Example 2: (Differential travel time as an IV, NICU data)
  • Example 3: (Preference-based IV, MIMIC-III data)
  • Definition 1: Confounders and Instrumental Variables, pearl2009causality
  • Remark 2: Generalization of Figure \ref{['fig:confoundedmcp']} (right panel)
  • Remark 3: On additive noise assumption
  • Remark 4: The challenge of UCs
  • Remark 5: Global IVs and global UCs
  • Proposition 1: CMR
  • Proposition 1
  • ...and 17 more