Table of Contents
Fetching ...

NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning

Jiahui Fu, Junyu Nan, Lingfeng Sun, Hongyu Li, Jianing Qian, Jennifer L. Barry, Kris Kitani, George Konidaris

TL;DR

NovaPlan is introduced, a hierarchical framework that unifies closed-loop VLM and video planning with geometrically grounded robot execution for zero-shot long-horizon manipulation and can perform complex assembly tasks and exhibit dexterous error recovery behaviors without any prior demonstrations or training.

Abstract

Solving long-horizon tasks requires robots to integrate high-level semantic reasoning with low-level physical interaction. While vision-language models (VLMs) and video generation models can decompose tasks and imagine outcomes, they often lack the physical grounding necessary for real-world execution. We introduce NovaPlan, a hierarchical framework that unifies closed-loop VLM and video planning with geometrically grounded robot execution for zero-shot long-horizon manipulation. At the high level, a VLM planner decomposes tasks into sub-goals and monitors robot execution in a closed loop, enabling the system to recover from single-step failures through autonomous re-planning. To compute low-level robot actions, we extract and utilize both task-relevant object keypoints and human hand poses as kinematic priors from the generated videos, and employ a switching mechanism to choose the better one as a reference for robot actions, maintaining stable execution even under heavy occlusion or depth inaccuracy. We demonstrate the effectiveness of NovaPlan on three long-horizon tasks and the Functional Manipulation Benchmark (FMB). Our results show that NovaPlan can perform complex assembly tasks and exhibit dexterous error recovery behaviors without any prior demonstrations or training. Project page: https://nova-plan.github.io/

NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning

TL;DR

NovaPlan is introduced, a hierarchical framework that unifies closed-loop VLM and video planning with geometrically grounded robot execution for zero-shot long-horizon manipulation and can perform complex assembly tasks and exhibit dexterous error recovery behaviors without any prior demonstrations or training.

Abstract

Solving long-horizon tasks requires robots to integrate high-level semantic reasoning with low-level physical interaction. While vision-language models (VLMs) and video generation models can decompose tasks and imagine outcomes, they often lack the physical grounding necessary for real-world execution. We introduce NovaPlan, a hierarchical framework that unifies closed-loop VLM and video planning with geometrically grounded robot execution for zero-shot long-horizon manipulation. At the high level, a VLM planner decomposes tasks into sub-goals and monitors robot execution in a closed loop, enabling the system to recover from single-step failures through autonomous re-planning. To compute low-level robot actions, we extract and utilize both task-relevant object keypoints and human hand poses as kinematic priors from the generated videos, and employ a switching mechanism to choose the better one as a reference for robot actions, maintaining stable execution even under heavy occlusion or depth inaccuracy. We demonstrate the effectiveness of NovaPlan on three long-horizon tasks and the Functional Manipulation Benchmark (FMB). Our results show that NovaPlan can perform complex assembly tasks and exhibit dexterous error recovery behaviors without any prior demonstrations or training. Project page: https://nova-plan.github.io/
Paper Structure (39 sections, 13 equations, 14 figures, 2 tables)

This paper contains 39 sections, 13 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Zero-shot long-horizon manipulation. Given a task and an initial observation, a video language planner is used to select and generate the best next-step video. Executable robot actions are extracted from the video using hand or object tracking. Updated observation is sent back to the planner to enable closed-loop reasoning and recovery from failure states.
  • Figure 2: NovaPlan system overview. A high-level planner takes in the task instruction and current observation, and proposes multiple task-solving videos after VLM reasoning about the task progress. Videos are selected based on flow and semantic consistency. The low-level planner calculates the robot action from the extracted hand or object flow. Updated observations are sent to a VLM critic to determine whether the robot should proceed to the next step or recover from failure.
  • Figure 3: NovaPlan low-level execution system overview. Given the generated video plan (RGB+Depth), we switch between (top) grounding the target object, tracking 3D keypoints, and recovering object flow, or (bottom) estimating the hand pose with HaMeR, calibrating 3D scale, and computing the hand flow. The resulting object/hand flows are converted into robot actions for on-robot execution, with a video language planner enabling closed-loop re-planning and recovery.
  • Figure 4: Geometric grounding for non-prehensile recovery. (a) HaMeR pavlakos_Reconstructing_2024 predicts the MANO hand mesh on the generated recovery video. (b) Red: mesh from raw prediction. Green: mesh after scale calibration. Cyan: mesh after additional translation offset enforcing fingertip and object surface contact. (c) The hand trajectory after calibration. (d) Calibrated trajectory converted into per-frame SE(3) transforms.
  • Figure 5: Robot Experiments. (a-c) The long horizon tasks in \ref{['sec:exp1']}, each with three steps. (d-e) Functional Manipulation Benchmark Multi-Object Multi-Stage Assembly 1 task and its variant in \ref{['sec:exp3']}, including one recovery step.
  • ...and 9 more figures