Table of Contents
Fetching ...

From Watch to Imagine: Steering Long-horizon Manipulation via Human Demonstration and Future Envisionment

Ke Ye, Jiaming Zhou, Yuanfeng Qiu, Jiayi Liu, Shihui Zhou, Kun-Yu Lin, Junwei Liang

TL;DR

This work tackles zero-shot long-horizon robotic manipulation by introducing Super-Mimic, a hierarchical framework that translates unscripted human demonstrations into executable plans and uses imagined futures to ground low-level control. The HIT module extracts key actions from videos and produces a language-grounded subtask plan, while the FDP module generates plausible future dynamics to produce 3D trajectories for execution. Together, they enable robust, flexible manipulation in open-world settings and outperform text-only baselines by over 20% on challenging tasks. The approach reduces reliance on task-specific training data and demonstrates strong generalization and resilience to ambiguous instructions, with implications for scalable, general-purpose robotic systems.

Abstract

Generalizing to long-horizon manipulation tasks in a zero-shot setting remains a central challenge in robotics. Current multimodal foundation based approaches, despite their capabilities, typically fail to decompose high-level commands into executable action sequences from static visual input alone. To address this challenge, we introduce Super-Mimic, a hierarchical framework that enables zero-shot robotic imitation by directly inferring procedural intent from unscripted human demonstration videos. Our framework is composed of two sequential modules. First, a Human Intent Translator (HIT) parses the input video using multimodal reasoning to produce a sequence of language-grounded subtasks. These subtasks then condition a Future Dynamics Predictor (FDP), which employs a generative model that synthesizes a physically plausible video rollout for each step. The resulting visual trajectories are dynamics-aware, explicitly modeling crucial object interactions and contact points to guide the low-level controller. We validate this approach through extensive experiments on a suite of long-horizon manipulation tasks, where Super-Mimic significantly outperforms state-of-the-art zero-shot methods by over 20%. These results establish that coupling video-driven intent parsing with prospective dynamics modeling is a highly effective strategy for developing general-purpose robotic systems.

From Watch to Imagine: Steering Long-horizon Manipulation via Human Demonstration and Future Envisionment

TL;DR

This work tackles zero-shot long-horizon robotic manipulation by introducing Super-Mimic, a hierarchical framework that translates unscripted human demonstrations into executable plans and uses imagined futures to ground low-level control. The HIT module extracts key actions from videos and produces a language-grounded subtask plan, while the FDP module generates plausible future dynamics to produce 3D trajectories for execution. Together, they enable robust, flexible manipulation in open-world settings and outperform text-only baselines by over 20% on challenging tasks. The approach reduces reliance on task-specific training data and demonstrates strong generalization and resilience to ambiguous instructions, with implications for scalable, general-purpose robotic systems.

Abstract

Generalizing to long-horizon manipulation tasks in a zero-shot setting remains a central challenge in robotics. Current multimodal foundation based approaches, despite their capabilities, typically fail to decompose high-level commands into executable action sequences from static visual input alone. To address this challenge, we introduce Super-Mimic, a hierarchical framework that enables zero-shot robotic imitation by directly inferring procedural intent from unscripted human demonstration videos. Our framework is composed of two sequential modules. First, a Human Intent Translator (HIT) parses the input video using multimodal reasoning to produce a sequence of language-grounded subtasks. These subtasks then condition a Future Dynamics Predictor (FDP), which employs a generative model that synthesizes a physically plausible video rollout for each step. The resulting visual trajectories are dynamics-aware, explicitly modeling crucial object interactions and contact points to guide the low-level controller. We validate this approach through extensive experiments on a suite of long-horizon manipulation tasks, where Super-Mimic significantly outperforms state-of-the-art zero-shot methods by over 20%. These results establish that coupling video-driven intent parsing with prospective dynamics modeling is a highly effective strategy for developing general-purpose robotic systems.

Paper Structure

This paper contains 19 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of Super-Mimic. The HIT module uses a VLM to translate human demonstrations into an adaptable symbolic plan, enabling task modifications and skill transfers beyond simple imitation. Then the FDP module employs a video generation model to imagine a plausible future execution for the current subtask. Finally, the Action Executor module grounds the imagined guidance into a final sequence of executable robot actions. Upon subtask completion, the system updates its observation and repeats the planning-execution loop.
  • Figure 2: The Human Intent Translator.
  • Figure 3: Our experimental platform, consisting of a 7-DoF xArm7 arm and a third-view Orbbec RGB-D camera.
  • Figure 4: Examples of major failures: (a) HIT planning error (e.g., misplacing the apple), (b) FDP prediction error (e.g., implausible object deformation), (c) Execution error (most common, e.g., grasp failure knocks over the cup).