Table of Contents
Fetching ...

Visual Imitation Learning with Calibrated Contrastive Representation

Yunke Wang, Linwei Tao, Bo Du, Yutian Lin, Chang Xu

TL;DR

Visual imitation learning with high-dimensional visual states benefits from a calibrated contrastive representation. The authors propose Contrastive Adversarial Imitation Learning (CAIL), which augments GAIL with unsupervised and supervised contrastive losses and a calibration mechanism that treats agent demonstrations as a mixture of qualities, enabling end-to-end training without architectural changes. The method shows improved sample efficiency and strong performance on the DMControl Suite, outperforming baselines such as GAIL, PCIL, and PatchAIL while maintaining computational efficiency. This approach enhances representation learning in visual IL, offering a practical path toward robust, data-efficient imitation from visual demonstrations.

Abstract

Adversarial Imitation Learning (AIL) allows the agent to reproduce expert behavior with low-dimensional states and actions. However, challenges arise in handling visual states due to their less distinguishable representation compared to low-dimensional proprioceptive features. While existing methods resort to adopt complex network architectures or separate the process of learning representation and decision-making, they overlook valuable intra-agent information within demonstrations. To address this problem, this paper proposes a simple and effective solution by incorporating calibrated contrastive representative learning into visual AIL framework. Specifically, we present an image encoder in visual AIL, utilizing a combination of unsupervised and supervised contrastive learning to extract valuable features from visual states. Based on the fact that the improved agent often produces demonstrations of varying quality, we propose to calibrate the contrastive loss by treating each agent demonstrations as a mixed sample. The incorporation of contrastive learning can be jointly optimized with the AIL framework, without modifying the architecture or incurring significant computational costs. Experimental results on DMControl Suite demonstrate our proposed method is sample efficient and can outperform other compared methods from different aspects.

Visual Imitation Learning with Calibrated Contrastive Representation

TL;DR

Visual imitation learning with high-dimensional visual states benefits from a calibrated contrastive representation. The authors propose Contrastive Adversarial Imitation Learning (CAIL), which augments GAIL with unsupervised and supervised contrastive losses and a calibration mechanism that treats agent demonstrations as a mixture of qualities, enabling end-to-end training without architectural changes. The method shows improved sample efficiency and strong performance on the DMControl Suite, outperforming baselines such as GAIL, PCIL, and PatchAIL while maintaining computational efficiency. This approach enhances representation learning in visual IL, offering a practical path toward robust, data-efficient imitation from visual demonstrations.

Abstract

Adversarial Imitation Learning (AIL) allows the agent to reproduce expert behavior with low-dimensional states and actions. However, challenges arise in handling visual states due to their less distinguishable representation compared to low-dimensional proprioceptive features. While existing methods resort to adopt complex network architectures or separate the process of learning representation and decision-making, they overlook valuable intra-agent information within demonstrations. To address this problem, this paper proposes a simple and effective solution by incorporating calibrated contrastive representative learning into visual AIL framework. Specifically, we present an image encoder in visual AIL, utilizing a combination of unsupervised and supervised contrastive learning to extract valuable features from visual states. Based on the fact that the improved agent often produces demonstrations of varying quality, we propose to calibrate the contrastive loss by treating each agent demonstrations as a mixed sample. The incorporation of contrastive learning can be jointly optimized with the AIL framework, without modifying the architecture or incurring significant computational costs. Experimental results on DMControl Suite demonstrate our proposed method is sample efficient and can outperform other compared methods from different aspects.
Paper Structure (12 sections, 1 theorem, 11 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 12 sections, 1 theorem, 11 equations, 6 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

Rewrite the objective function of CAIL as $\min_{\theta,f}\max_{h_d} \mathcal{J}_{h_d,f}(\theta)$, where where $\Psi(f)$ denotes the contrastive constraint. Regardless of the encoder $f$, $\mathcal{L}_{h_d,f}(\theta)$ can converge with respect to $\theta$, and the convergence point is reached when $\rho_{\pi_{\theta^\ast}}=\rho_{\pi_e}$.

Figures (6)

  • Figure 1: Two visual states $\textbf{v}_1, \textbf{v}_2$ and their corresponding physical states $\textbf{s}_1, \textbf{s}_2$ are shown in the figure. The physical state on the right column contains proprioceptive information (i.e., positions and velocities). Although a significant change in the physical state occurs, it may only result in slight changes to the visual state.
  • Figure 2: An overview of Contrastive Adversarial Imitation Learning. The encoder extracts the representation of the augmented expert state, and two augmented agent states. The training objective of the encoder consists of a discrimination loss, an unsupervised contrastive loss and a calibrated supervised contrastive loss.
  • Figure 3: Benchmarking domains. Top: Cartpole, Finger, Hopper. Bottom: Cheetah, Walker, and Quadruped.
  • Figure 4: Learning curves of CAIL and PatchAIL on 5 DMC tasks with respect to training time. We scale the training time and '1' denotes the time that CAIL completed 1M steps. It is obvious that given the same training time, CAIL can outperform PatchAIL.
  • Figure 5: Spatial attention map of discriminator at $\textit{1M}$ steps. The map shows the region that the discriminators focus on to make the decision.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Theorem 1