Table of Contents
Fetching ...

A Unifying Framework for Causal Imitation Learning with Hidden Confounders

Daqian Shao, Thomas Kleine Buening, Marta Kwiatkowska

TL;DR

DML-IL is proposed, a novel algorithm that uses instrumental variable regression to solve a set of Conditional Moment Restrictions and learn a policy, and is provided a bound on the imitation gap for DML-IL.

Abstract

We propose a general and unifying framework for causal Imitation Learning (IL) with hidden confounders that subsumes several existing confounded IL settings from the literature. Our framework accounts for two types of hidden confounders: (a) those observed by the expert, which thus influence the expert's policy, and (b) confounding noise hidden to both the expert and the IL algorithm. For additional flexibility, we also introduce a confounding noise horizon and time-varying expert-observable hidden variables. We show that causal IL in our framework can be reduced to a set of Conditional Moment Restrictions (CMRs) by leveraging trajectory histories as instruments to learn a history-dependent policy. We propose DML-IL, a novel algorithm that uses instrumental variable regression to solve these CMRs and learn a policy. We provide a bound on the imitation gap for DML-IL, which recovers prior results as special cases. Empirical evaluation on a toy environment with continues state-action spaces and multiple Mujoco tasks demonstrate that DML-IL outperforms state-of-the-art causal IL algorithms.

A Unifying Framework for Causal Imitation Learning with Hidden Confounders

TL;DR

DML-IL is proposed, a novel algorithm that uses instrumental variable regression to solve a set of Conditional Moment Restrictions and learn a policy, and is provided a bound on the imitation gap for DML-IL.

Abstract

We propose a general and unifying framework for causal Imitation Learning (IL) with hidden confounders that subsumes several existing confounded IL settings from the literature. Our framework accounts for two types of hidden confounders: (a) those observed by the expert, which thus influence the expert's policy, and (b) confounding noise hidden to both the expert and the IL algorithm. For additional flexibility, we also introduce a confounding noise horizon and time-varying expert-observable hidden variables. We show that causal IL in our framework can be reduced to a set of Conditional Moment Restrictions (CMRs) by leveraging trajectory histories as instruments to learn a history-dependent policy. We propose DML-IL, a novel algorithm that uses instrumental variable regression to solve these CMRs and learn a policy. We provide a bound on the imitation gap for DML-IL, which recovers prior results as special cases. Empirical evaluation on a toy environment with continues state-action spaces and multiple Mujoco tasks demonstrate that DML-IL outperforms state-of-the-art causal IL algorithms.

Paper Structure

This paper contains 43 sections, 4 theorems, 40 equations, 5 figures, 1 table, 2 algorithms.

Key Result

Proposition 4.3

The ill-posedness $\nu(\Pi,k)$ is monotonically increasing as the confounded horizon $k$ increases.

Figures (5)

  • Figure 1: A causal graph of MDPs with hidden confounders, where at each time step the hidden confounder is $u_t=(u^o_t,u^\varepsilon_t)$. The black dotted lines represent the causal effect of the expert-observable confounder $u^o_t$, which directly affects $a_t$ because the expert policy can observe $u^o_t$. It also directly affects $s_{t+1}$ and $r_t$ because otherwise it is irrelevant to the expected return and there is no reason for the expert to consider it. The red dotted lines represent the causal effect of $u^\varepsilon_t$ that is not observable by the expert, which acts as confounding noise and directly affects the states and actions. $u^\varepsilon_t$ does not directly affect $r_t$ (following Swamy2022_temporal) because the expert policy does not take $u^\varepsilon_t$ into account, and letting $u^\varepsilon_t$ directly affect $r_t$ would only add noise to the expected return.
  • Figure 2: The MSE between the learnt policy and the expert, and the average reward, in the plane ticket environment (Example \ref{['eg:plane']}).
  • Figure 3: The MSE between the learnt policy and expert, and the average reward, in Mujoco environments.
  • Figure 4: Additional results for the MSE between learnt policy and expert, and the average reward, in the plane ticket environment (Example \ref{['eg:plane']}), with DFIV and DeepGMM as the CMRs solver.
  • Figure 5: Additional results for the MSE between learnt policy and expert, and the average reward, Ant Mujoco environment, with DFIV and DeepGMM as the CMRs solver.

Theorems & Definitions (13)

  • Example 3.1
  • Remark 4.1
  • Definition 4.2: The ill-posedness of CMRs Dikkala2020Chen2012
  • Proposition 4.3
  • Definition 4.4: c-total variation stability Bassily2021Swamy2022_temporal
  • Theorem 4.5: Imitation Gap Bound
  • Corollary 4.6
  • Corollary 4.7
  • Remark 4.8
  • proof
  • ...and 3 more