Table of Contents
Fetching ...

DITTO: Demonstration Imitation by Trajectory Transformation

Nick Heppert, Max Argus, Tim Welschehold, Thomas Brox, Abhinav Valada

TL;DR

DITTO addresses one-shot imitation from a single RGB-D human demonstration by introducing an object-centric, two-stage pipeline: offline demonstration trajectory extraction and online trajectory generation. The method relies on robust object segmentation and relative pose estimation to warp the demonstrated trajectory into a new scene, followed by grasps and motion planning for robot execution. Thorough offline ablations and extensive real-robot experiments demonstrate the approach’s practicality, highlighting both strengths and failure modes related to detection, correspondence, and kinematic constraints. The work advances user-friendly, data-efficient robot learning by enabling passive human demonstrations and providing open-source resources to facilitate future benchmarking and component-level improvements.

Abstract

Teaching robots new skills quickly and conveniently is crucial for the broader adoption of robotic systems. In this work, we address the problem of one-shot imitation from a single human demonstration, given by an RGB-D video recording. We propose a two-stage process. In the first stage we extract the demonstration trajectory offline. This entails segmenting manipulated objects and determining their relative motion in relation to secondary objects such as containers. In the online trajectory generation stage, we first re-detect all objects, then warp the demonstration trajectory to the current scene and execute it on the robot. To complete these steps, our method leverages several ancillary models, including those for segmentation, relative object pose estimation, and grasp prediction. We systematically evaluate different combinations of correspondence and re-detection methods to validate our design decision across a diverse range of tasks. Specifically, we collect and quantitatively test on demonstrations of ten different tasks including pick-and-place tasks as well as articulated object manipulation. Finally, we perform extensive evaluations on a real robot system to demonstrate the effectiveness and utility of our approach in real-world scenarios. We make the code publicly available at http://ditto.cs.uni-freiburg.de.

DITTO: Demonstration Imitation by Trajectory Transformation

TL;DR

DITTO addresses one-shot imitation from a single RGB-D human demonstration by introducing an object-centric, two-stage pipeline: offline demonstration trajectory extraction and online trajectory generation. The method relies on robust object segmentation and relative pose estimation to warp the demonstrated trajectory into a new scene, followed by grasps and motion planning for robot execution. Thorough offline ablations and extensive real-robot experiments demonstrate the approach’s practicality, highlighting both strengths and failure modes related to detection, correspondence, and kinematic constraints. The work advances user-friendly, data-efficient robot learning by enabling passive human demonstrations and providing open-source resources to facilitate future benchmarking and component-level improvements.

Abstract

Teaching robots new skills quickly and conveniently is crucial for the broader adoption of robotic systems. In this work, we address the problem of one-shot imitation from a single human demonstration, given by an RGB-D video recording. We propose a two-stage process. In the first stage we extract the demonstration trajectory offline. This entails segmenting manipulated objects and determining their relative motion in relation to secondary objects such as containers. In the online trajectory generation stage, we first re-detect all objects, then warp the demonstration trajectory to the current scene and execute it on the robot. To complete these steps, our method leverages several ancillary models, including those for segmentation, relative object pose estimation, and grasp prediction. We systematically evaluate different combinations of correspondence and re-detection methods to validate our design decision across a diverse range of tasks. Specifically, we collect and quantitatively test on demonstrations of ten different tasks including pick-and-place tasks as well as articulated object manipulation. Finally, we perform extensive evaluations on a real robot system to demonstrate the effectiveness and utility of our approach in real-world scenarios. We make the code publicly available at http://ditto.cs.uni-freiburg.de.
Paper Structure (20 sections, 6 equations, 4 figures, 3 tables)

This paper contains 20 sections, 6 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Human demonstration of manipulation actions are transferred to new scenes so that a robot can replicate the manipulation action. For this, we use a two-stage process, first extracting object trajectories by segmenting and tracking objects. Then, we transfer the trajectory to a new scene by re-detection and trajectory transformation according to the re-detected positions. The proposed method is then evaluated on several different tasks.
  • Figure 2: Our method first computes masks and trajectories from demonstration videos, then maps these onto new live observations by accounting for the change in object poses. These warped trajectories can either be evaluated separately or executed on a robot by using grasp planning and IK trajectory solvers.
  • Figure 3: Robot setup showing the Franka manipulator, with an end-of-arm depth camera, mounted onto a mobile base.
  • Figure 4: Examples of trajectory generation, shown for various different tasks. Top row: rendered examples of trajectories extracted from human demonstrations, in-painted into the initial demonstration observation. Middle row: rendered trajectories that have been generated in-situ for the robot imitation, in-painted into the live robot view. Bottom row: images from live robot imitation runs.