Table of Contents
Fetching ...

Learning Multi-Step Manipulation Tasks from A Single Human Demonstration

Dingkun Guo

TL;DR

This work tackles the problem of data-efficient learning for multi-step robot manipulation from a single human demonstration. It presents a three-module system—vision, learning, and manipulation—that converts RGBD demonstrations into executable robot primitives by identifying task-relevant key poses with Grounded Segment Anything and by tracking hand-object and object-object contacts. The learning module segments actions into three primitives Make-Maintain-Break and generates policies that map object poses and contacts to robot motions, while the manipulation module adapts these policies to different robots and environments via pose proposals and collision-aware planning. Experiments in a lab and a home kitchen show notable per-step success in seen contexts, with challenges in generalizing to unseen objects and environments, highlighting the importance of robust pose estimation and motion planning. The approach offers a data-efficient, modular pathway toward generalizable robot manipulation from a single demonstration, with clear avenues for extending to more complex objects and tasks.

Abstract

Learning from human demonstrations has exhibited remarkable achievements in robot manipulation. However, the challenge remains to develop a robot system that matches human capabilities and data efficiency in learning and generalizability, particularly in complex, unstructured real-world scenarios. We propose a system that processes RGBD videos to translate human actions to robot primitives and identifies task-relevant key poses of objects using Grounded Segment Anything. We then address challenges for robots in replicating human actions, considering the human-robot differences in kinematics and collision geometry. To test the effectiveness of our system, we conducted experiments focusing on manual dishwashing. With a single human demonstration recorded in a mockup kitchen, the system achieved 50-100% success for each step and up to a 40% success rate for the whole task with different objects in a home kitchen. Videos are available at https://robot-dishwashing.github.io

Learning Multi-Step Manipulation Tasks from A Single Human Demonstration

TL;DR

This work tackles the problem of data-efficient learning for multi-step robot manipulation from a single human demonstration. It presents a three-module system—vision, learning, and manipulation—that converts RGBD demonstrations into executable robot primitives by identifying task-relevant key poses with Grounded Segment Anything and by tracking hand-object and object-object contacts. The learning module segments actions into three primitives Make-Maintain-Break and generates policies that map object poses and contacts to robot motions, while the manipulation module adapts these policies to different robots and environments via pose proposals and collision-aware planning. Experiments in a lab and a home kitchen show notable per-step success in seen contexts, with challenges in generalizing to unseen objects and environments, highlighting the importance of robust pose estimation and motion planning. The approach offers a data-efficient, modular pathway toward generalizable robot manipulation from a single demonstration, with clear avenues for extending to more complex objects and tasks.

Abstract

Learning from human demonstrations has exhibited remarkable achievements in robot manipulation. However, the challenge remains to develop a robot system that matches human capabilities and data efficiency in learning and generalizability, particularly in complex, unstructured real-world scenarios. We propose a system that processes RGBD videos to translate human actions to robot primitives and identifies task-relevant key poses of objects using Grounded Segment Anything. We then address challenges for robots in replicating human actions, considering the human-robot differences in kinematics and collision geometry. To test the effectiveness of our system, we conducted experiments focusing on manual dishwashing. With a single human demonstration recorded in a mockup kitchen, the system achieved 50-100% success for each step and up to a 40% success rate for the whole task with different objects in a home kitchen. Videos are available at https://robot-dishwashing.github.io
Paper Structure (15 sections, 2 equations, 4 figures, 1 table)

This paper contains 15 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Human-Robot Comparative Overview of the Manual Dishwashing Task. Our system, with a single human video demonstration, learns multi-step manipulation tasks, including manipulation of articulated objects (switching on the faucet in the left column), non-prehensile manipulation (rinsing a bowl in the middle column), and pick-and-place (right column).
  • Figure 2: System Overview. The vision module processes RGBD images captured from (a) Multi-Camera System into a (b) Point Cloud Registration. With (c) Object Segmentation, we convert point clouds to models ((d) Model Construction) for (e) Object Pose Estimation. In the learning module, we calculate distances between object point clouds for (f) Hand-Object and Object-Object Contact Detection, and we use changes in contact relationships to guide (g) Primitive Segmentation and Classification and (h) Policy Generation. The manipulation module executes robot policy ((i) Policy Execution) to generate a timed object trajectory. With adjusted desired object poses from (j) Alternative Object Pose Proposal, (k) Inverse Kinematics and Motion Planning with Collision Avoidance finds a robot path and sends joint angle commands to the (l) Robot Controller.
  • Figure 3: Robot, Objects, and Simulation Environment. The (a) original Franka Hand fingers are so short that the robot blocks water. We solved this by designing (b) longer fingers. We waterproof the robot with poly tubing on its arm and a glove on the gripper. We use objects in (c) for testing. We use (d) a simulated environment in PyBullet for collision checking and motion planning.
  • Figure 4: Overview of the Task of Washing A Bowl. The system segments human actions into robot primitives (top). We recorded demonstrations in the lab kitchen (middle row) and tested the system in the home kitchen (bottom row).