Table of Contents
Fetching ...

General Flow as Foundation Affordance for Scalable Robot Learning

Chengbo Yuan, Chuan Wen, Tong Zhang, Yang Gao

TL;DR

This work introduces General Flow as a foundation affordance for scalable robot learning. It learns a language-conditioned 3D flow predictor directly from large-scale RGBD human videos and employs a multimodal, scale-aware model (ScaleFlow) to predict future trajectories of points on objects. The approach enables zero-shot real-world robot manipulation with a simple closed-loop policy, achieving 81% success across 18 tasks in 6 scenes. By coupling cross-embodiment data with a robust training regimen and augmentations, it demonstrates scalable, generalizable post-grasp guidance across rigid, articulated, and soft objects.

Abstract

We address the challenge of acquiring real-world manipulation skills with a scalable framework. We hold the belief that identifying an appropriate prediction target capable of leveraging large-scale datasets is crucial for achieving efficient and universal learning. Therefore, we propose to utilize 3D flow, which represents the future trajectories of 3D points on objects of interest, as an ideal prediction target. To exploit scalable data resources, we turn our attention to human videos. We develop, for the first time, a language-conditioned 3D flow prediction model directly from large-scale RGBD human video datasets. Our predicted flow offers actionable guidance, thus facilitating zero-shot skill transfer in real-world scenarios. We deploy our method with a policy based on closed-loop flow prediction. Remarkably, without any in-domain finetuning, our method achieves an impressive 81\% success rate in zero-shot human-to-robot skill transfer, covering 18 tasks in 6 scenes. Our framework features the following benefits: (1) scalability: leveraging cross-embodiment data resources; (2) wide application: multiple object categories, including rigid, articulated, and soft bodies; (3) stable skill transfer: providing actionable guidance with a small inference domain-gap. Code, data, and supplementary materials are available https://general-flow.github.io

General Flow as Foundation Affordance for Scalable Robot Learning

TL;DR

This work introduces General Flow as a foundation affordance for scalable robot learning. It learns a language-conditioned 3D flow predictor directly from large-scale RGBD human videos and employs a multimodal, scale-aware model (ScaleFlow) to predict future trajectories of points on objects. The approach enables zero-shot real-world robot manipulation with a simple closed-loop policy, achieving 81% success across 18 tasks in 6 scenes. By coupling cross-embodiment data with a robust training regimen and augmentations, it demonstrates scalable, generalizable post-grasp guidance across rigid, articulated, and soft objects.

Abstract

We address the challenge of acquiring real-world manipulation skills with a scalable framework. We hold the belief that identifying an appropriate prediction target capable of leveraging large-scale datasets is crucial for achieving efficient and universal learning. Therefore, we propose to utilize 3D flow, which represents the future trajectories of 3D points on objects of interest, as an ideal prediction target. To exploit scalable data resources, we turn our attention to human videos. We develop, for the first time, a language-conditioned 3D flow prediction model directly from large-scale RGBD human video datasets. Our predicted flow offers actionable guidance, thus facilitating zero-shot skill transfer in real-world scenarios. We deploy our method with a policy based on closed-loop flow prediction. Remarkably, without any in-domain finetuning, our method achieves an impressive 81\% success rate in zero-shot human-to-robot skill transfer, covering 18 tasks in 6 scenes. Our framework features the following benefits: (1) scalability: leveraging cross-embodiment data resources; (2) wide application: multiple object categories, including rigid, articulated, and soft bodies; (3) stable skill transfer: providing actionable guidance with a small inference domain-gap. Code, data, and supplementary materials are available https://general-flow.github.io
Paper Structure (38 sections, 4 equations, 10 figures, 6 tables, 1 algorithm)

This paper contains 38 sections, 4 equations, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: We propose General Flow as a foundational affordance. Our framework uses general flow affordance as a bridge representation for human-to-robot skill transfer. Trained solely on RGBD human video datasets, our system achieves an average success rate of 81% on 18 real-world robot manipulation tasks, highlighting a pathway for scalable robot learning.
  • Figure 2: The framework of our prediction model. We build pipelines to extract general flow labels from both RGBD human video datasets. Then, multiple design elements are utilized to enhance the scale-awareness and robustness of the prediction model.
  • Figure 3: Design architecture of our model. The model employs a CLIP encoder to convert instructions into semantic features and utilizes a PointNeXt backbone along with a conditional VAE to capture the multimodality of different action trajectories.
  • Figure 4: Visualization of general flow prediction. With only trained on human video datasets, our model could predict general flow robustly in zero-shot robot deployment scene.
  • Figure 5: We achieve stable zero-shot human-to-robot skill transfer in the real world, encompassing 18 tasks with rigid, articulated, and soft objects across 6 scenes.
  • ...and 5 more figures