Table of Contents
Fetching ...

Manipulator-Independent Representations for Visual Imitation

Yuxiang Zhou, Yusuf Aytar, Konstantinos Bousmalis

TL;DR

The paper tackles cross-embodiment visual imitation for robotic manipulation using only visual observations. It introduces manipulator-independent representations (MIR) learned with a temporally smooth contrastive objective (TSCN) and cross-domain goal-conditioned policies (CD-GCP), augmented by domain randomization to bridge sim-to-real gaps. Through an RL-based trajectory tracking framework, MIR demonstrates superior cross-embodiment imitation across unseen embodiments and challenging object interactions, including real-world scenarios. The work provides a dataset, training regime, and empirical evidence that a perception module focusing on environmental changes and temporal continuity, while remaining actionable for RL, enables high-fidelity imitation across diverse morphologies. This advances the practical applicability of third-person, cross-embodiment imitation in robotics.

Abstract

Imitation learning is an effective tool for robotic learning tasks where specifying a reinforcement learning (RL) reward is not feasible or where the exploration problem is particularly difficult. Imitation, typically behavior cloning or inverse RL, derive a policy from a collection of first-person action-state trajectories. This is contrary to how humans and other animals imitate: we observe a behavior, even from other species, understand its perceived effect on the state of the environment, and figure out what actions our body can perform to reach a similar outcome. In this work, we explore the possibility of third-person visual imitation of manipulation trajectories, only from vision and without access to actions, demonstrated by embodiments different to the ones of our imitating agent. Specifically, we investigate what would be an appropriate representation method with which an RL agent can visually track trajectories of complex manipulation behavior -- non-planar with multiple-object interactions -- demonstrated by experts with different embodiments. We present a way to train manipulator-independent representations (MIR) that primarily focus on the change in the environment and have all the characteristics that make them suitable for cross-embodiment visual imitation with RL: cross-domain alignment, temporal smoothness, and being actionable. We show that with our proposed method our agents are able to imitate, with complex robot control, trajectories from a variety of embodiments and with significant visual and dynamics differences, e.g. simulation-to-reality gap.

Manipulator-Independent Representations for Visual Imitation

TL;DR

The paper tackles cross-embodiment visual imitation for robotic manipulation using only visual observations. It introduces manipulator-independent representations (MIR) learned with a temporally smooth contrastive objective (TSCN) and cross-domain goal-conditioned policies (CD-GCP), augmented by domain randomization to bridge sim-to-real gaps. Through an RL-based trajectory tracking framework, MIR demonstrates superior cross-embodiment imitation across unseen embodiments and challenging object interactions, including real-world scenarios. The work provides a dataset, training regime, and empirical evidence that a perception module focusing on environmental changes and temporal continuity, while remaining actionable for RL, enables high-fidelity imitation across diverse morphologies. This advances the practical applicability of third-person, cross-embodiment imitation in robotics.

Abstract

Imitation learning is an effective tool for robotic learning tasks where specifying a reinforcement learning (RL) reward is not feasible or where the exploration problem is particularly difficult. Imitation, typically behavior cloning or inverse RL, derive a policy from a collection of first-person action-state trajectories. This is contrary to how humans and other animals imitate: we observe a behavior, even from other species, understand its perceived effect on the state of the environment, and figure out what actions our body can perform to reach a similar outcome. In this work, we explore the possibility of third-person visual imitation of manipulation trajectories, only from vision and without access to actions, demonstrated by embodiments different to the ones of our imitating agent. Specifically, we investigate what would be an appropriate representation method with which an RL agent can visually track trajectories of complex manipulation behavior -- non-planar with multiple-object interactions -- demonstrated by experts with different embodiments. We present a way to train manipulator-independent representations (MIR) that primarily focus on the change in the environment and have all the characteristics that make them suitable for cross-embodiment visual imitation with RL: cross-domain alignment, temporal smoothness, and being actionable. We show that with our proposed method our agents are able to imitate, with complex robot control, trajectories from a variety of embodiments and with significant visual and dynamics differences, e.g. simulation-to-reality gap.

Paper Structure

This paper contains 17 sections, 3 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Our work focuses on learning manipulator-independent representations (MIR) that can be used for imitating trajectories of behavior demonstrated with different embodiments, unseen during training, solely from pixel observations, even in the presence of large visual domain gaps.
  • Figure 2: Learning manipulator-independent representation (MIR) space. MIR is trained on a dataset generated using two pairs of environments: (a) Domain-randomized and 'invisible arm' environment and (b) domain-randomized and arm-only-randomized environments. Please see Section \ref{['sec:dataset']} for the details of dataset collection.
  • Figure 3: Embedding space distances for same-domain and cross-domain goals. The first time-step is assumed to be the current observation and the goals are selected over the entire demonstration sequence with increasing temporal distance. Note that objects are manually aligned in simulation to match the first observation in the real-world sequence. The distance plots are normalized to $[0,1]$ for each trajectory and aggregated across $10$ trajectories. We also provide the Spearman's rank correlation between the reachability distance (i.e. linear increase of distance over time) and embedding distance for each of the trajectories. The mean rank correlations over all $10$ trajectories are displayed on each plot. Note that TCN performs reasonably well in the same domain but not across domains. MIR and its two components separately (i.e. TSCN and CD-GCP) are significantly better correlated with the reachability across domains. MIR has similar performance within and across domains.
  • Figure 4: Visualization of the different environment used in our work. Our imitating agent always operates in a canonical simulated environment (left). We have 3 additional versions of it which are used to generate data for perceptual training (middle). Finally we use 4 held-out demonstration domains, both in simulation and in the real world, for cross-embodiment imitation.
  • Figure 5: t-SNE Projection of 3 triplets of trajectories, where each triplet consists of paired trajectories rendered in 3 visually different domains: with an 'invisible arm', with full domain randomization and with domain randomization only for the robotic arm. The features used were learned with cross-domain goal-conditioned policy (CD-GCP) combined with TSCN. The 3 triplets are clearly separated from each other and the trajectories from the different domains seem to align well, especially when there is an environment change that doesn't include the arm, a desired feature for our representations. See text for more details.
  • ...and 2 more figures