Semi-Supervised Cross-Domain Imitation Learning

Li-Min Chu; Kai-Siang Ma; Ming-Hong Chen; Ping-Chun Hsieh

Semi-Supervised Cross-Domain Imitation Learning

Li-Min Chu, Kai-Siang Ma, Ming-Hong Chen, Ping-Chun Hsieh

TL;DR

This work tackles cross-domain imitation learning under limited target-domain supervision by introducing Semi-Supervised CDIL and AdaptDICE, an offline framework that transfers knowledge from a source domain with imperfect demonstrations to a target domain with few expert trajectories. The method combines a cross-domain mapping loss to bridge domain gaps, a hybrid density-ratio for cross-domain policy extraction, and an adaptive weighting β(t) to balance source and target contributions, all trained offline without requiring paired demonstrations. The authors establish convergence guarantees for density-ratio estimation and demonstrate consistent gains over baselines on MuJoCo and Robosuite, achieving stable, data-efficient policy learning with minimal supervision. The approach offers practical benefits for real-world deployment where collecting target-domain expert data is costly or hazardous, enabling robust cross-domain imitation with limited labels and abundant imperfect data.

Abstract

Cross-domain imitation learning (CDIL) accelerates policy learning by transferring expert knowledge across domains, which is valuable in applications where the collection of expert data is costly. Existing methods are either supervised, relying on proxy tasks and explicit alignment, or unsupervised, aligning distributions without paired data, but often unstable. We introduce the Semi-Supervised CDIL (SS-CDIL) setting and propose the first algorithm for SS-CDIL with theoretical justification. Our method uses only offline data, including a small number of target expert demonstrations and some unlabeled imperfect trajectories. To handle domain discrepancy, we propose a novel cross-domain loss function for learning inter-domain state-action mappings and design an adaptive weight function to balance the source and target knowledge. Experiments on MuJoCo and Robosuite show consistent gains over the baselines, demonstrating that our approach achieves stable and data-efficient policy learning with minimal supervision. Our code is available at~ https://github.com/NYCU-RL-Bandits-Lab/CDIL.

Semi-Supervised Cross-Domain Imitation Learning

TL;DR

Abstract

Paper Structure (37 sections, 4 theorems, 66 equations, 14 figures, 11 tables, 1 algorithm)

This paper contains 37 sections, 4 theorems, 66 equations, 14 figures, 11 tables, 1 algorithm.

Introduction
Preliminaries
Cross-Domain Imitation Learning
Regularized Distribution Matching
Methodology
Semi-Supervised CDIL
Proposed Algorithm
Cross-Domain Mapping Loss.
DICE Loss.
Cross-Domain Policy Extraction.
Pseudo-Reward Computation.
Convergence Analysis
Design Choice of Weighting Factor $\beta(t)$
Practical Implementation
Experiments
...and 22 more sections

Key Result

Theorem 1

[Upper Bound of Cross-Domain Density Ratio Error Under AdaptDICE] Under AdaptDICE, with learning rate $\eta \leq 1/L_f$, for each $(s,a)$, the cross-domain density ratio error is bounded as follows: where $L_f$ is the smoothness constant of $L_{\text{DICE}}$, $\nu^*_{\text{tar}}:=\Pi_{S^*_{\text{tar}}}(\nu^{(0)}_{\text{tar}})$, and $C_{w}$ is a constant. Moreover, by selecting $\beta(t)$ as the

Figures (14)

Figure 1: An illustration of CDIL formulations: (a) CDIL with proxy tasks utilizes annotated target demonstrations through paired or unpaired proxy tasks, trading off accuracy and data efficiency. (b) Unsupervised CDIL relies only on source experts and unlabeled target data and typically requires assumptions on domain similarity (e.g., isomorphism) and can suffer from ineffective transfer. (c) Semi-supervised CDIL combines limited labeled target data with unlabeled trajectories, balancing supervision cost and transferability.
Figure 2: Training curves of AdaptDICE and the baseline methods in the Default setting: (a)-(c) MuJoCo locomotion tasks; (d)-(f) Robot arm manipulation tasks in Robosuite.
Figure 3: Ablation study: Training curves of the full AdaptDICE and its two ablation variants (i.e., using only $w_{\text{src}}$ or $w_{\text{tar}}$).
Figure 4: Effect of dataset configurations under AdaptDICE: We compare performance across the Expert Rich and Sub-Optimal Rich regimes, showing that both types of data expansion lead to improvements over the Default dataset setting.
Figure 5: Effect of dataset configurations under SMODICE.
...and 9 more figures

Theorems & Definitions (7)

Theorem 1
Lemma 1: Properties of the DemoDICE Loss
proof
Lemma 2: Convergence of DemoDICE
proof
Theorem 1
proof

Semi-Supervised Cross-Domain Imitation Learning

TL;DR

Abstract

Semi-Supervised Cross-Domain Imitation Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (7)