Table of Contents
Fetching ...

ManiLong-Shot: Interaction-Aware One-Shot Imitation Learning for Long-Horizon Manipulation

Zixuan Chen, Chongkai Gao, Lin Shao, Jieqi Shi, Jing Huo, Yang Gao

TL;DR

ManiLong-Shot tackles the challenge of one-shot imitation learning for long-horizon manipulation by modeling tasks as sequences of interaction-driven primitives. It introduces an interaction-aware pipeline with (i) task decomposition into pre-contact, grasping, and post-contact phases, (ii) invariant-region prediction per phase, and (iii) region matching to transfer demonstrated primitives to new scenes, guided by either a Vision-Language Model or rule-based cues. Trained on 10 short-horizon tasks, it generalizes to 20 unseen long-horizon tasks in simulation and validates sim-to-real transfer on real hardware, achieving a substantial improvement over state-of-the-art in LH OSIL. The work offers a practical OSIL framework for complex, multi-step manipulation with robust generalization and a new RLBench-Oneshot benchmark for evaluation.

Abstract

One-shot imitation learning (OSIL) offers a promising way to teach robots new skills without large-scale data collection. However, current OSIL methods are primarily limited to short-horizon tasks, thus limiting their applicability to complex, long-horizon manipulations. To address this limitation, we propose ManiLong-Shot, a novel framework that enables effective OSIL for long-horizon prehensile manipulation tasks. ManiLong-Shot structures long-horizon tasks around physical interaction events, reframing the problem as sequencing interaction-aware primitives instead of directly imitating continuous trajectories. This primitive decomposition can be driven by high-level reasoning from a vision-language model (VLM) or by rule-based heuristics derived from robot state changes. For each primitive, ManiLong-Shot predicts invariant regions critical to the interaction, establishes correspondences between the demonstration and the current observation, and computes the target end-effector pose, enabling effective task execution. Extensive simulation experiments show that ManiLong-Shot, trained on only 10 short-horizon tasks, generalizes to 20 unseen long-horizon tasks across three difficulty levels via one-shot imitation, achieving a 22.8% relative improvement over the SOTA. Additionally, real-robot experiments validate ManiLong-Shot's ability to robustly execute three long-horizon manipulation tasks via OSIL, confirming its practical applicability.

ManiLong-Shot: Interaction-Aware One-Shot Imitation Learning for Long-Horizon Manipulation

TL;DR

ManiLong-Shot tackles the challenge of one-shot imitation learning for long-horizon manipulation by modeling tasks as sequences of interaction-driven primitives. It introduces an interaction-aware pipeline with (i) task decomposition into pre-contact, grasping, and post-contact phases, (ii) invariant-region prediction per phase, and (iii) region matching to transfer demonstrated primitives to new scenes, guided by either a Vision-Language Model or rule-based cues. Trained on 10 short-horizon tasks, it generalizes to 20 unseen long-horizon tasks in simulation and validates sim-to-real transfer on real hardware, achieving a substantial improvement over state-of-the-art in LH OSIL. The work offers a practical OSIL framework for complex, multi-step manipulation with robust generalization and a new RLBench-Oneshot benchmark for evaluation.

Abstract

One-shot imitation learning (OSIL) offers a promising way to teach robots new skills without large-scale data collection. However, current OSIL methods are primarily limited to short-horizon tasks, thus limiting their applicability to complex, long-horizon manipulations. To address this limitation, we propose ManiLong-Shot, a novel framework that enables effective OSIL for long-horizon prehensile manipulation tasks. ManiLong-Shot structures long-horizon tasks around physical interaction events, reframing the problem as sequencing interaction-aware primitives instead of directly imitating continuous trajectories. This primitive decomposition can be driven by high-level reasoning from a vision-language model (VLM) or by rule-based heuristics derived from robot state changes. For each primitive, ManiLong-Shot predicts invariant regions critical to the interaction, establishes correspondences between the demonstration and the current observation, and computes the target end-effector pose, enabling effective task execution. Extensive simulation experiments show that ManiLong-Shot, trained on only 10 short-horizon tasks, generalizes to 20 unseen long-horizon tasks across three difficulty levels via one-shot imitation, achieving a 22.8% relative improvement over the SOTA. Additionally, real-robot experiments validate ManiLong-Shot's ability to robustly execute three long-horizon manipulation tasks via OSIL, confirming its practical applicability.

Paper Structure

This paper contains 31 sections, 3 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: We introduce ManiLong-Shot, a novel framework for effective OSIL in long-horizon prehensile manipulation.
  • Figure 2: The Overall Training Pipeline of ManiLong-Shot. Best viewed when zoomed in.
  • Figure 3: Visualization of the three physical interaction phases.
  • Figure 4: Visualization of 20 long-horizon manipulation tasks in RLBench-Oneshot, including 3 difficulty levels.
  • Figure 5: Three real-world LH tasks with their physical interaction visualizations.
  • ...and 5 more figures