Table of Contents
Fetching ...

3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model

Hongyan Zhi, Peihao Chen, Siyuan Zhou, Yubo Dong, Quanxi Wu, Lei Han, Mingkui Tan

TL;DR

This work tackles the generalization gap in robotic manipulation by learning an embodiment-agnostic, object-centric 3D flow representation. It introduces ManiFlow-110k and a diffusion-based 3D flow world model that predicts object trajectories conditioned on tasks, enabling closed-loop planning through a flow-guided rendering loop and GPT-4o verification. The predicted 3D flow serves as a constraint for an optimization-based action policy, enabling cross-embodiment transfer without hardware-specific training. Extensive experiments demonstrate strong generalization across tasks, robust cross-embodiment adaptation, and better 3D motion capture than 2D-flow or non-object-centric approaches.

Abstract

Manipulation has long been a challenging task for robots, while humans can effortlessly perform complex interactions with objects, such as hanging a cup on the mug rack. A key reason is the lack of a large and uniform dataset for teaching robots manipulation skills. Current robot datasets often record robot action in different action spaces within a simple scene. This hinders the robot to learn a unified and robust action representation for different robots within diverse scenes. Observing how humans understand a manipulation task, we find that understanding how the objects should move in the 3D space is a critical clue for guiding actions. This clue is embodiment-agnostic and suitable for both humans and different robots. Motivated by this, we aim to learn a 3D flow world model from both human and robot manipulation data. This model predicts the future movement of the interacting objects in 3D space, guiding action planning for manipulation. Specifically, we synthesize a large-scale 3D optical flow dataset, named ManiFlow-110k, through a moving object auto-detect pipeline. A video diffusion-based world model then learns manipulation physics from these data, generating 3D optical flow trajectories conditioned on language instructions. With the generated 3D object optical flow, we propose a flow-guided rendering mechanism, which renders the predicted final state and leverages GPT-4o to assess whether the predicted flow aligns with the task description. This equips the robot with a closed-loop planning ability. Finally, we consider the predicted 3D optical flow as constraints for an optimization policy to determine a chunk of robot actions for manipulation. Extensive experiments demonstrate strong generalization across diverse robotic manipulation tasks and reliable cross-embodiment adaptation without hardware-specific training.

3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model

TL;DR

This work tackles the generalization gap in robotic manipulation by learning an embodiment-agnostic, object-centric 3D flow representation. It introduces ManiFlow-110k and a diffusion-based 3D flow world model that predicts object trajectories conditioned on tasks, enabling closed-loop planning through a flow-guided rendering loop and GPT-4o verification. The predicted 3D flow serves as a constraint for an optimization-based action policy, enabling cross-embodiment transfer without hardware-specific training. Extensive experiments demonstrate strong generalization across tasks, robust cross-embodiment adaptation, and better 3D motion capture than 2D-flow or non-object-centric approaches.

Abstract

Manipulation has long been a challenging task for robots, while humans can effortlessly perform complex interactions with objects, such as hanging a cup on the mug rack. A key reason is the lack of a large and uniform dataset for teaching robots manipulation skills. Current robot datasets often record robot action in different action spaces within a simple scene. This hinders the robot to learn a unified and robust action representation for different robots within diverse scenes. Observing how humans understand a manipulation task, we find that understanding how the objects should move in the 3D space is a critical clue for guiding actions. This clue is embodiment-agnostic and suitable for both humans and different robots. Motivated by this, we aim to learn a 3D flow world model from both human and robot manipulation data. This model predicts the future movement of the interacting objects in 3D space, guiding action planning for manipulation. Specifically, we synthesize a large-scale 3D optical flow dataset, named ManiFlow-110k, through a moving object auto-detect pipeline. A video diffusion-based world model then learns manipulation physics from these data, generating 3D optical flow trajectories conditioned on language instructions. With the generated 3D object optical flow, we propose a flow-guided rendering mechanism, which renders the predicted final state and leverages GPT-4o to assess whether the predicted flow aligns with the task description. This equips the robot with a closed-loop planning ability. Finally, we consider the predicted 3D optical flow as constraints for an optimization policy to determine a chunk of robot actions for manipulation. Extensive experiments demonstrate strong generalization across diverse robotic manipulation tasks and reliable cross-embodiment adaptation without hardware-specific training.

Paper Structure

This paper contains 24 sections, 2 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: 3DFlowAction seeks to build a flow world model to generate 3D optical flow that serves as action guidance for downstream manipulation tasks. Experiments on four complex foundational tasks in different settings demonstrate strong generalization across various manipulation tasks and reliable cross-embodiment adaptation without hardware-specific training.
  • Figure 2: Overview of 3D Flow Generation pipeline. (I) We synthesized the 3D flow dataset ManiFlow-110k using a moving object detection pipeline. (II) We pre-trained a video diffusion model as the flow world model on ManiFlow-110k to learn the physical motion patterns of objects in manipulation tasks. (III) ManiFlow-110k comes from a wide range of robot and human videos.
  • Figure 3: Overview of flow-guided action generation pipeline . (I) 3DFlowAction first performs closed-loop 3D flow generation through a self-correcting process. (II) A task-aware grasp pose generation process selects a task-relevant grasp pose while avoiding unreachable target positions. (III) An optimization procedure conditioning on 3D flow solves a chunk of actions.
  • Figure 4: Demonstration of placement of four foundational tasks.
  • Figure 5: Visualization of planning and execution from different world models for pouring tea from the teapot to the cup. All baseline methods for planning are correct; however, their code-base or 2D planning struggles to fully capture the motion of objects in 3D space, resulting in failures in action planning.
  • ...and 4 more figures