Visually Robust Adversarial Imitation Learning from Videos with Contrastive Learning
Vittorio Giammarino, James Queeney, Ioannis Ch. Paschalidis
TL;DR
This paper tackles Visual Imitation from Observations under visual mismatch between expert and agent environments by introducing C-LAIfO, an efficient end-to-end method that learns a domain-invariant latent representation through data augmentation and contrastive learning. Imitation is conducted in the latent space with off-policy adversarial learning, using two replay buffers and a discriminator to infer a reward signal and train a policy. The authors provide extensive ablations and demonstrate superior performance over baselines on mismatched visual tasks and on challenging Adroit dexterous manipulation with sparse rewards, highlighting robustness to lighting and background changes. The work also emphasizes the importance of carefully designed augmentations and latent-space training, and it releases open-source code to support reproducibility and further development.
Abstract
We propose C-LAIfO, a computationally efficient algorithm designed for imitation learning from videos in the presence of visual mismatch between agent and expert domains. We analyze the problem of imitation from expert videos with visual discrepancies, and introduce a solution for robust latent space estimation using contrastive learning and data augmentation. Provided a visually robust latent space, our algorithm performs imitation entirely within this space using off-policy adversarial imitation learning. We conduct a thorough ablation study to justify our design and test C-LAIfO on high-dimensional continuous robotic tasks. Additionally, we demonstrate how C-LAIfO can be combined with other reward signals to facilitate learning on a set of challenging hand manipulation tasks with sparse rewards. Our experiments show improved performance compared to baseline methods, highlighting the effectiveness of C-LAIfO. To ensure reproducibility, we open source our code.
