Table of Contents
Fetching ...

One-Shot Dual-Arm Imitation Learning

Yilong Wang, Edward Johns

TL;DR

ODIL tackles one-shot learning for dual-arm manipulation by combining a dual-arm coordination paradigm with a three-stage visual servoing controller that first aligns to a bottleneck and then replays a single demonstrated trajectory. It leverages deep feature matching for robust visual alignment and fuses information from a global camera and a wrist camera via an Unscented Kalman Filter to achieve precise, robust localization across 4-DoF and 6-DoF tasks, even with distractors and occlusions. The method outperforms state-of-the-art one-shot imitation baselines on six real-world tasks and demonstrates resilience to scene changes, without requiring object models or additional data collection. These results indicate a practical and scalable route to data-efficient dual-arm manipulation in everyday tasks. Future work includes extending to multi-stage tasks, incorporating failure recovery, and generalizing to novel objects.

Abstract

We introduce One-Shot Dual-Arm Imitation Learning (ODIL), which enables dual-arm robots to learn precise and coordinated everyday tasks from just a single demonstration of the task. ODIL uses a new three-stage visual servoing (3-VS) method for precise alignment between the end-effector and target object, after which replay of the demonstration trajectory is sufficient to perform the task. This is achieved without requiring prior task or object knowledge, or additional data collection and training following the single demonstration. Furthermore, we propose a new dual-arm coordination paradigm for learning dual-arm tasks from a single demonstration. ODIL was tested on a real-world dual-arm robot, demonstrating state-of-the-art performance across six precise and coordinated tasks in both 4-DoF and 6-DoF settings, and showing robustness in the presence of distractor objects and partial occlusions. Videos are available at: https://www.robot-learning.uk/one-shot-dual-arm.

One-Shot Dual-Arm Imitation Learning

TL;DR

ODIL tackles one-shot learning for dual-arm manipulation by combining a dual-arm coordination paradigm with a three-stage visual servoing controller that first aligns to a bottleneck and then replays a single demonstrated trajectory. It leverages deep feature matching for robust visual alignment and fuses information from a global camera and a wrist camera via an Unscented Kalman Filter to achieve precise, robust localization across 4-DoF and 6-DoF tasks, even with distractors and occlusions. The method outperforms state-of-the-art one-shot imitation baselines on six real-world tasks and demonstrates resilience to scene changes, without requiring object models or additional data collection. These results indicate a practical and scalable route to data-efficient dual-arm manipulation in everyday tasks. Future work includes extending to multi-stage tasks, incorporating failure recovery, and generalizing to novel objects.

Abstract

We introduce One-Shot Dual-Arm Imitation Learning (ODIL), which enables dual-arm robots to learn precise and coordinated everyday tasks from just a single demonstration of the task. ODIL uses a new three-stage visual servoing (3-VS) method for precise alignment between the end-effector and target object, after which replay of the demonstration trajectory is sufficient to perform the task. This is achieved without requiring prior task or object knowledge, or additional data collection and training following the single demonstration. Furthermore, we propose a new dual-arm coordination paradigm for learning dual-arm tasks from a single demonstration. ODIL was tested on a real-world dual-arm robot, demonstrating state-of-the-art performance across six precise and coordinated tasks in both 4-DoF and 6-DoF settings, and showing robustness in the presence of distractor objects and partial occlusions. Videos are available at: https://www.robot-learning.uk/one-shot-dual-arm.

Paper Structure

This paper contains 12 sections, 6 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: ODIL learns precise, coordinated tasks from a single demonstration (first column) and adapts to novel robot, object, and scene configurations. Starting from an arbitrary position (second column), the robot first aligns with the object (third column), and then replays the demonstration (last column).
  • Figure 2: Overall Framework. During demonstration, the end-effector $E$ is first moved to a bottleneck ${B}$, with the task object $O$ visible to both the global and wrist cameras, ${C}_{g}$ and ${C}_{w}$, respectively. Then, from each camera, a bottleneck RGB-D image is captured and segmented, and a demonstration trajectory is recorded and parameterized into a coordinated trajectory $\boldsymbol{\tau}$. During testing, starting from an arbitrary initial robot pose where $O$ is visible to ${C}_{g}$, we use our 3-VS controller to align $E$ with the new ${B}$, then execute $\boldsymbol{\tau}$ to complete the task.
  • Figure 3: Dual-arm Coordination Paradigm. The tasks visualized include lifting a pot, splitting tape, uncapping a bottle, and stirring a bowl. Blue denotes one arm, orange the other.
  • Figure 4: Three-stage Visual Servoing. (1) The process begins with 3D Visual Servoing, using the initial open-loop global-camera bottleneck estimate for control until the wrist camera's estimates become stable. (2) Once stable, the initial global-camera estimate serves as a prior and is fused with redundant sequential wrist-camera estimates using a Kalman filter, until the overlap between the bottleneck and current wrist-camera images exceeds a threshold. (3) The control strategy then shifts to 2 1/2 D Visual Servoing until convergence. Finally, the coordinated trajectory is adapted and executed.
  • Figure 5: Comparison of the best 4 methods from Table \ref{['tab:correspondence_comparison']} near convergence with strong illumination variations.
  • ...and 1 more figures