Table of Contents
Fetching ...

Visually Robust Adversarial Imitation Learning from Videos with Contrastive Learning

Vittorio Giammarino, James Queeney, Ioannis Ch. Paschalidis

TL;DR

This paper tackles Visual Imitation from Observations under visual mismatch between expert and agent environments by introducing C-LAIfO, an efficient end-to-end method that learns a domain-invariant latent representation through data augmentation and contrastive learning. Imitation is conducted in the latent space with off-policy adversarial learning, using two replay buffers and a discriminator to infer a reward signal and train a policy. The authors provide extensive ablations and demonstrate superior performance over baselines on mismatched visual tasks and on challenging Adroit dexterous manipulation with sparse rewards, highlighting robustness to lighting and background changes. The work also emphasizes the importance of carefully designed augmentations and latent-space training, and it releases open-source code to support reproducibility and further development.

Abstract

We propose C-LAIfO, a computationally efficient algorithm designed for imitation learning from videos in the presence of visual mismatch between agent and expert domains. We analyze the problem of imitation from expert videos with visual discrepancies, and introduce a solution for robust latent space estimation using contrastive learning and data augmentation. Provided a visually robust latent space, our algorithm performs imitation entirely within this space using off-policy adversarial imitation learning. We conduct a thorough ablation study to justify our design and test C-LAIfO on high-dimensional continuous robotic tasks. Additionally, we demonstrate how C-LAIfO can be combined with other reward signals to facilitate learning on a set of challenging hand manipulation tasks with sparse rewards. Our experiments show improved performance compared to baseline methods, highlighting the effectiveness of C-LAIfO. To ensure reproducibility, we open source our code.

Visually Robust Adversarial Imitation Learning from Videos with Contrastive Learning

TL;DR

This paper tackles Visual Imitation from Observations under visual mismatch between expert and agent environments by introducing C-LAIfO, an efficient end-to-end method that learns a domain-invariant latent representation through data augmentation and contrastive learning. Imitation is conducted in the latent space with off-policy adversarial learning, using two replay buffers and a discriminator to infer a reward signal and train a policy. The authors provide extensive ablations and demonstrate superior performance over baselines on mismatched visual tasks and on challenging Adroit dexterous manipulation with sparse rewards, highlighting robustness to lighting and background changes. The work also emphasizes the importance of carefully designed augmentations and latent-space training, and it releases open-source code to support reproducibility and further development.

Abstract

We propose C-LAIfO, a computationally efficient algorithm designed for imitation learning from videos in the presence of visual mismatch between agent and expert domains. We analyze the problem of imitation from expert videos with visual discrepancies, and introduce a solution for robust latent space estimation using contrastive learning and data augmentation. Provided a visually robust latent space, our algorithm performs imitation entirely within this space using off-policy adversarial imitation learning. We conduct a thorough ablation study to justify our design and test C-LAIfO on high-dimensional continuous robotic tasks. Additionally, we demonstrate how C-LAIfO can be combined with other reward signals to facilitate learning on a set of challenging hand manipulation tasks with sparse rewards. Our experiments show improved performance compared to baseline methods, highlighting the effectiveness of C-LAIfO. To ensure reproducibility, we open source our code.
Paper Structure (24 sections, 2 theorems, 10 equations, 14 figures, 5 tables, 1 algorithm)

This paper contains 24 sections, 2 theorems, 10 equations, 14 figures, 5 tables, 1 algorithm.

Key Result

Proposition 1

Consider source and target POMDPs respectively defined by the tuples $(\mathcal{S}, \mathcal{A}, \mathcal{X}, \mathcal{T}, \mathcal{U}_T, \mathcal{R}, \rho_0, \gamma)$ and $(\mathcal{S}, \mathcal{A}, \mathcal{X}, \mathcal{T}, \mathcal{U}_S, \mathcal{R}, \rho_0, \gamma)$. Let $\mathcal{X} = (\bar{\ma

Figures (14)

  • Figure 1: Robotic manipulation task. Current end-to-end methods for imitation from expert videos assume that the expert and the agent operate in the same environment. Consequently, they are unable to handle variations in lighting or background.
  • Figure 2: Summary of C-LAIfO. In the diagram, black lines indicate shared weights among networks, blue arrows indicate forward pass through the networks, and red arrows indicate backward pass. The losses $\mathcal{L}_{D}$, $\mathcal{L}_Q$ and $\mathcal{L}(z_{\bm{\delta}})$ are respectively in \ref{['eq:AIL_BCE']}, \ref{['eq:Q_regression_regularized']}, and \ref{['eq:contr_loss']}. $\mathcal{L}_{\pi}$ indicates the deterministic actor-critic loss silver2014deterministic.
  • Figure 3: Different environments used for the experiments in Table \ref{['table_visual_experiments']} and the PCA in Fig. \ref{['fig:walker_PCA_light']} and \ref{['fig:walker_PCA_full']}.
  • Figure 4: PCA results for the Light experiment in Table \ref{['table_visual_experiments']}.
  • Figure 5: PCA results on C-LAIfO for the Full experiment in Table \ref{['table_visual_experiments']} and the unseen environment in Fig. \ref{['fig:walker_unseen']}.
  • ...and 9 more figures

Theorems & Definitions (4)

  • Proposition 1
  • proof
  • Proposition 2
  • proof