Domain-Invariant Per-Frame Feature Extraction for Cross-Domain Imitation Learning with Visual Observations
Minung Kim, Kawon Lee, Jungmo Kim, Sungho Choi, Seungyul Han
TL;DR
The paper tackles cross-domain imitation learning with high-dimensional visual observations by introducing DIFF-IL, which combines domain-invariant per-frame feature extraction with frame-wise time labeling and adversarial sequence-level alignment. It defines a structured methodology using a shared encoder, domain-specific decoders, and Wasserstein GANs to remove domain-specific artifacts while preserving task-relevant cues, augmented by frame and sequence labeling to shape rewards. Empirical results across Pendulum and MuJoCo tasks show DIFF-IL achieves superior domain transfer, faster convergence, and robust imitation, with ablations confirming the importance of frame-level labeling and balanced per-frame/sequence mapping. The approach enables more reliable vision-based cross-domain imitation and has implications for sim-to-real transfer and robust autonomous control in visually diverse settings.
Abstract
Imitation learning (IL) enables agents to mimic expert behavior without reward signals but faces challenges in cross-domain scenarios with high-dimensional, noisy, and incomplete visual observations. To address this, we propose Domain-Invariant Per-Frame Feature Extraction for Imitation Learning (DIFF-IL), a novel IL method that extracts domain-invariant features from individual frames and adapts them into sequences to isolate and replicate expert behaviors. We also introduce a frame-wise time labeling technique to segment expert behaviors by timesteps and assign rewards aligned with temporal contexts, enhancing task performance. Experiments across diverse visual environments demonstrate the effectiveness of DIFF-IL in addressing complex visual tasks.
