Table of Contents
Fetching ...

One-Shot Imitation under Mismatched Execution

Kushal Kedia, Prithwish Dan, Angela Chao, Maximus Adrian Pace, Sanjiban Choudhury

TL;DR

RHyME, a novel framework that automatically pairs human and robot trajectories using sequence-level optimal transport cost functions and synthesizes semantically equivalent human videos by retrieving and composing short-horizon human clips, facilitates effective policy training without the need for paired data.

Abstract

Human demonstrations as prompts are a powerful way to program robots to do long-horizon manipulation tasks. However, translating these demonstrations into robot-executable actions presents significant challenges due to execution mismatches in movement styles and physical capabilities. Existing methods for human-robot translation either depend on paired data, which is infeasible to scale, or rely heavily on frame-level visual similarities that often break down in practice. To address these challenges, we propose RHyME, a novel framework that automatically pairs human and robot trajectories using sequence-level optimal transport cost functions. Given long-horizon robot demonstrations, RHyME synthesizes semantically equivalent human videos by retrieving and composing short-horizon human clips. This approach facilitates effective policy training without the need for paired data. RHyME successfully imitates a range of cross-embodiment demonstrators, both in simulation and with a real human hand, achieving over 50% increase in task success compared to previous methods. We release our code and datasets at https://portal-cornell.github.io/rhyme/.

One-Shot Imitation under Mismatched Execution

TL;DR

RHyME, a novel framework that automatically pairs human and robot trajectories using sequence-level optimal transport cost functions and synthesizes semantically equivalent human videos by retrieving and composing short-horizon human clips, facilitates effective policy training without the need for paired data.

Abstract

Human demonstrations as prompts are a powerful way to program robots to do long-horizon manipulation tasks. However, translating these demonstrations into robot-executable actions presents significant challenges due to execution mismatches in movement styles and physical capabilities. Existing methods for human-robot translation either depend on paired data, which is infeasible to scale, or rely heavily on frame-level visual similarities that often break down in practice. To address these challenges, we propose RHyME, a novel framework that automatically pairs human and robot trajectories using sequence-level optimal transport cost functions. Given long-horizon robot demonstrations, RHyME synthesizes semantically equivalent human videos by retrieving and composing short-horizon human clips. This approach facilitates effective policy training without the need for paired data. RHyME successfully imitates a range of cross-embodiment demonstrators, both in simulation and with a real human hand, achieving over 50% increase in task success compared to previous methods. We release our code and datasets at https://portal-cornell.github.io/rhyme/.
Paper Structure (9 sections, 1 equation, 6 figures, 2 algorithms)

This paper contains 9 sections, 1 equation, 6 figures, 2 algorithms.

Figures (6)

  • Figure 1: Overview of RHyME. We introduce RHyME, a hierarchical framework that trains a robot policy to mimic a long-horizon video from a demonstrator that exhibits mismatched task execution. Train Goal (Left): Given unpaired human-robot datasets, RHyME "imagines" employs sequence-level similarity functions to create a paired dataset for training one-shot imitation robot policies. Inference Goal (Right): Our robot policy translates a human video into robot actions to perform the specified long-horizon task.
  • Figure 2: Performance on Mismatched Execution Datasets. We present results on three datasets (left). As the demonstrator's execution deviates further from those of the robot, policies trained with our framework RHyME consistently outperforms XSkill measured by task recall and imprecision rates.
  • Figure 3: Realworld Results. (Left) Task Embeddings: We use t-SNE to visualize cross-embodiment latent embeddings from the human and robot completing three tasks. (Right) Task Completion: We compare the performance of RHyME with XSkill on seen and unseen long-horizon tasks specified by human prompt videos. Opaque segments indicate Task Completion rate, and augmented transparent bars indicate Task Attempt rate.
  • Figure 4: Cross-Embodiment Vision Embeddings. (Left) Visualizing task embeddings. We use t-SNE to visualize cross-embodiment latent embeddings generated by robot and demonstrator when executing different tasks on all three datasets. (Right) TCC Failure Example: The robot and video clip are equivalent, but specific frames have high TCC losses. For example, a frame showing the robot performing the 'kettle' action has a high loss due to its nearest neighbor in the video performing both 'kettle' and 'light' actions. This frame cycles back to the robot performing 'light', which is mismatched.
  • Figure 5: Optimal Transport Distances. We measure the similarity between robot and demonstrator videos on the Sphere-Hard dataset by computing the cost of the Optimal Transport (OT) plans. The sum over the entire transport cost matrix costs yields the distance between videos. OT costs are lowest when tasks are the same between videos (highlighted by a tick).
  • ...and 1 more figures