Manipulator-Independent Representations for Visual Imitation
Yuxiang Zhou, Yusuf Aytar, Konstantinos Bousmalis
TL;DR
The paper tackles cross-embodiment visual imitation for robotic manipulation using only visual observations. It introduces manipulator-independent representations (MIR) learned with a temporally smooth contrastive objective (TSCN) and cross-domain goal-conditioned policies (CD-GCP), augmented by domain randomization to bridge sim-to-real gaps. Through an RL-based trajectory tracking framework, MIR demonstrates superior cross-embodiment imitation across unseen embodiments and challenging object interactions, including real-world scenarios. The work provides a dataset, training regime, and empirical evidence that a perception module focusing on environmental changes and temporal continuity, while remaining actionable for RL, enables high-fidelity imitation across diverse morphologies. This advances the practical applicability of third-person, cross-embodiment imitation in robotics.
Abstract
Imitation learning is an effective tool for robotic learning tasks where specifying a reinforcement learning (RL) reward is not feasible or where the exploration problem is particularly difficult. Imitation, typically behavior cloning or inverse RL, derive a policy from a collection of first-person action-state trajectories. This is contrary to how humans and other animals imitate: we observe a behavior, even from other species, understand its perceived effect on the state of the environment, and figure out what actions our body can perform to reach a similar outcome. In this work, we explore the possibility of third-person visual imitation of manipulation trajectories, only from vision and without access to actions, demonstrated by embodiments different to the ones of our imitating agent. Specifically, we investigate what would be an appropriate representation method with which an RL agent can visually track trajectories of complex manipulation behavior -- non-planar with multiple-object interactions -- demonstrated by experts with different embodiments. We present a way to train manipulator-independent representations (MIR) that primarily focus on the change in the environment and have all the characteristics that make them suitable for cross-embodiment visual imitation with RL: cross-domain alignment, temporal smoothness, and being actionable. We show that with our proposed method our agents are able to imitate, with complex robot control, trajectories from a variety of embodiments and with significant visual and dynamics differences, e.g. simulation-to-reality gap.
