Table of Contents
Fetching ...

NovaFlow: Zero-Shot Manipulation via Actionable Flow from Generated Videos

Hongyu Li, Lingfeng Sun, Yafei Hu, Duy Ta, Jennifer Barry, George Konidaris, Jiahui Fu

TL;DR

NovaFlow tackles the data bottleneck in robotic manipulation by decoupling task understanding from control and leveraging pretrained video generation models to infer general motion knowledge. It converts a task description and a single observation into an actionable 3D object flow through a flow generator, then executes actions via a flow executor that handles rigid and deformable objects without robot-specific training. The key contributions are the actionable 3D object flow representation, the two-stage pipeline of flow generation and execution, and the demonstrated zero-shot manipulation across embodiments including table-top and mobile platforms. The approach shows state-of-the-art zero-shot performance on real-world tasks and highlights the potential of using large-scale video priors for robotics, while also pointing to the need for closed-loop feedback to address execution-time dynamics.

Abstract

Enabling robots to execute novel manipulation tasks zero-shot is a central goal in robotics. Most existing methods assume in-distribution tasks or rely on fine-tuning with embodiment-matched data, limiting transfer across platforms. We present NovaFlow, an autonomous manipulation framework that converts a task description into an actionable plan for a target robot without any demonstrations. Given a task description, NovaFlow synthesizes a video using a video generation model and distills it into 3D actionable object flow using off-the-shelf perception modules. From the object flow, it computes relative poses for rigid objects and realizes them as robot actions via grasp proposals and trajectory optimization. For deformable objects, this flow serves as a tracking objective for model-based planning with a particle-based dynamics model. By decoupling task understanding from low-level control, NovaFlow naturally transfers across embodiments. We validate on rigid, articulated, and deformable object manipulation tasks using a table-top Franka arm and a Spot quadrupedal mobile robot, and achieve effective zero-shot execution without demonstrations or embodiment-specific training. Project website: https://novaflow.lhy.xyz/.

NovaFlow: Zero-Shot Manipulation via Actionable Flow from Generated Videos

TL;DR

NovaFlow tackles the data bottleneck in robotic manipulation by decoupling task understanding from control and leveraging pretrained video generation models to infer general motion knowledge. It converts a task description and a single observation into an actionable 3D object flow through a flow generator, then executes actions via a flow executor that handles rigid and deformable objects without robot-specific training. The key contributions are the actionable 3D object flow representation, the two-stage pipeline of flow generation and execution, and the demonstrated zero-shot manipulation across embodiments including table-top and mobile platforms. The approach shows state-of-the-art zero-shot performance on real-world tasks and highlights the potential of using large-scale video priors for robotics, while also pointing to the need for closed-loop feedback to address execution-time dynamics.

Abstract

Enabling robots to execute novel manipulation tasks zero-shot is a central goal in robotics. Most existing methods assume in-distribution tasks or rely on fine-tuning with embodiment-matched data, limiting transfer across platforms. We present NovaFlow, an autonomous manipulation framework that converts a task description into an actionable plan for a target robot without any demonstrations. Given a task description, NovaFlow synthesizes a video using a video generation model and distills it into 3D actionable object flow using off-the-shelf perception modules. From the object flow, it computes relative poses for rigid objects and realizes them as robot actions via grasp proposals and trajectory optimization. For deformable objects, this flow serves as a tracking objective for model-based planning with a particle-based dynamics model. By decoupling task understanding from low-level control, NovaFlow naturally transfers across embodiments. We validate on rigid, articulated, and deformable object manipulation tasks using a table-top Franka arm and a Spot quadrupedal mobile robot, and achieve effective zero-shot execution without demonstrations or embodiment-specific training. Project website: https://novaflow.lhy.xyz/.

Paper Structure

This paper contains 32 sections, 10 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Flow generator pipeline. Given an initial image and a task prompt, a video model is used to generate a video of the plausible object motion. This video is then processed by pretrained perception modules to distill an actionable 3D object flow. This involves (1) lifting the 2D video to 3D using monocular depth estimation, (2) calibrating the estimated depth against the initial depth, (3) tracking the dense per-point motion using 3D point tracking, and (4) extracting the object-centric 3D flow via object grounding.
  • Figure 2: Flow executor pipeline. The initial end-effector pose is determined from grasp proposal candidates. Robot trajectories are then planned based on the actionable flow considering costs and constraints, and subsequently tracked by the robots.
  • Figure 3: Rejection sampling for flow generator. We generate multiple video candidates in parallel and create the object flow image for each by back-projecting its object flow, $\mathcal{F}$, onto the initial frame. A VLM (in our case, Google Gemini) evaluates all the flow images to select the most plausible video candidate.
  • Figure 4: Experiment results. We compare against Diffusion Policy (DP) chi_diffusion_2023 trained using 10 and 30 demonstrations, inverse dynamics model (IDM) from UniPi du_learning_2023, AVDC ko_learning_2023, and VidBot chen_vidbot_2025 in real-world tabletop manipulation tasks.
  • Figure 5: Real-world manipulation experiments. NovaFlow is versatile and supports cross-embodiment manipulation, which we use to manipulate rigid, deformable, and articulated objects using tabletop and mobile manipulator.
  • ...and 10 more figures