Offline Imitation Learning with Variational Counterfactual Reasoning

Bowei He; Zexu Sun; Jinxin Liu; Shuai Zhang; Xu Chen; Chen Ma

Offline Imitation Learning with Variational Counterfactual Reasoning

Bowei He, Zexu Sun, Jinxin Liu, Shuai Zhang, Xu Chen, Chen Ma

TL;DR

The paper tackles offline imitation learning when expert data is scarce and unlabeled data is noisy, proposing Offline Imitation Learning with Counterfactual Data Augmentation (OILCA). OILCA uses an identifiable variational counterfactual reasoning framework to generate high-quality counterfactual expert data via a conditional VAE and SCM-based counterfactuals, improving generalization without online interaction. The authors provide identifiability and generalization analyses and demonstrate strong improvements on in-distribution benchmarks (DeepMind Control Suite) and out-of-distribution scenarios (CausalWorld). The approach is validated with extensive experiments and accompanied by a public code release, highlighting its practical impact for robust offline IL in diverse environments.

Abstract

In offline imitation learning (IL), an agent aims to learn an optimal expert behavior policy without additional online environment interactions. However, in many real-world scenarios, such as robotics manipulation, the offline dataset is collected from suboptimal behaviors without rewards. Due to the scarce expert data, the agents usually suffer from simply memorizing poor trajectories and are vulnerable to variations in the environments, lacking the capability of generalizing to new environments. To automatically generate high-quality expert data and improve the generalization ability of the agent, we propose a framework named \underline{O}ffline \underline{I}mitation \underline{L}earning with \underline{C}ounterfactual data \underline{A}ugmentation (OILCA) by doing counterfactual inference. In particular, we leverage identifiable variational autoencoder to generate \textit{counterfactual} samples for expert data augmentation. We theoretically analyze the influence of the generated expert data and the improvement of generalization. Moreover, we conduct extensive experiments to demonstrate that our approach significantly outperforms various baselines on both \textsc{DeepMind Control Suite} benchmark for in-distribution performance and \textsc{CausalWorld} benchmark for out-of-distribution generalization. Our code is available at \url{https://github.com/ZexuSun/OILCA-NeurIPS23}.

Offline Imitation Learning with Variational Counterfactual Reasoning

TL;DR

Abstract

Paper Structure (37 sections, 6 theorems, 24 equations, 10 figures, 7 tables, 1 algorithm)

This paper contains 37 sections, 6 theorems, 24 equations, 10 figures, 7 tables, 1 algorithm.

Introduction
Related Works
Offline IL
Causal Dynamics RL
Preliminaries
Problem Definition
Counterfactual Reasoning
SCM representation of causal MDP
Variational Autoencoder and Identifiability
Offline Imitation Learning with Counterfactual Data Augmentation
Counterfactual Data Augmentation
Offline Agent Imitation Learning
Theoretical Analysis
Experiments
Simulations on Toy Environment (Q1)
...and 22 more sections

Key Result

Theorem 1

Assume that we observe data sampled from a generative model defined according to Equation eq:model-eq:exo_noise and Equation eq:tlambda with parameters $(\boldsymbol{f},\boldsymbol{T},\boldsymbol{\lambda})$, the following holds: Then, the parameters $\boldsymbol{\theta}=(\boldsymbol{f}, \boldsymbol{T}, \boldsymbol{\lambda})$ are identifiable up to an equivalence class induced by permutation and c

Figures (10)

Figure 1: Agent is trained with the collected dataset containing limited expert data and large amounts of unlabeled data, and tested on both in-distribution and out-of-distribution environments.
Figure 2: SCM of causal Markov Decision Process (MDP). We incorporate an exogenous variable in the SCM that is learned and utilized for counterfactual reasoning about do-intervention.
Figure 3: Visualization of both observation and latent spaces of the exogenous variable. (a) Samples from the true distribution of the sources $p_{\boldsymbol{\theta}^*}(u | c)$. (b) Samples from the posterior $q_{\boldsymbol{\phi}}\left(u | s_t, a_t, s_{t+1}, c\right)$. (c) Samples from the posterior $q_{\boldsymbol{\phi'}}\left(u| s_t, a_t, s_{t+1}\right)$ without class label.
Figure 4: Performance of OILCA and baselines in the toy environment. We plot the mean and the standard errors of the averaged return over five random seeds.
Figure 5: Performance of OILCA with the growing percentage of $|\mathcal{D}_{E}| /|\mathcal{D}_{U}|$. We plot the average return's mean and standard errors over five random seeds.
...and 5 more figures

Theorems & Definitions (11)

Definition 1: Structural Causal Model (SCM)
Definition 2: do-intervention in SCM
Theorem 1
Theorem 2
Theorem 3
Theorem 4
proof
Lemma 1
Lemma 2
proof
...and 1 more

Offline Imitation Learning with Variational Counterfactual Reasoning

TL;DR

Abstract

Offline Imitation Learning with Variational Counterfactual Reasoning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (11)