Flow-Enabled Generalization to Human Demonstrations in Few-Shot Imitation Learning

Runze Tang; Penny Sweetser

Flow-Enabled Generalization to Human Demonstrations in Few-Shot Imitation Learning

Runze Tang, Penny Sweetser

TL;DR

Imitation learning in robotics often demands extensive demonstrations. The paper introduces SFCrP, a two-component framework combining SFCr (cross-embodiment scene flow predictor) and FCrP (flow- and cropped point cloud-conditioned policy) to leverage human demonstrations for robust robot manipulation. SFCr predicts trajectories for arbitrary scene points using a Transformer-based flow predictor and segmentation to bridge human and robot embodiments, while FCrP uses a diffusion-based policy conditioned on raw flow and a localized cropped point cloud with a flow-state–action alignment to enable precise actions and prevent overfitting. Real-world experiments show improved flow prediction accuracy and higher task success rates than state-of-the-art baselines, including strong generalization to tasks seen only in human videos, demonstrating a practical approach to reducing IL data requirements and enhancing cross-embodiment generalization.

Abstract

Imitation Learning (IL) enables robots to learn complex skills from demonstrations without explicit task modeling, but it typically requires large amounts of demonstrations, creating significant collection costs. Prior work has investigated using flow as an intermediate representation to enable the use of human videos as a substitute, thereby reducing the amount of required robot demonstrations. However, most prior work has focused on the flow, either on the object or on specific points of the robot/hand, which cannot describe the motion of interaction. Meanwhile, relying on flow to achieve generalization to scenarios observed only in human videos remains limited, as flow alone cannot capture precise motion details. Furthermore, conditioning on scene observation to produce precise actions may cause the flow-conditioned policy to overfit to training tasks and weaken the generalization indicated by the flow. To address these gaps, we propose SFCrP, which includes a Scene Flow prediction model for Cross-embodiment learning (SFCr) and a Flow and Cropped point cloud conditioned Policy (FCrP). SFCr learns from both robot and human videos and predicts any point trajectories. FCrP follows the general flow motion and adjusts the action based on observations for precision tasks. Our method outperforms SOTA baselines across various real-world task settings, while also exhibiting strong spatial and instance generalization to scenarios seen only in human videos.

Flow-Enabled Generalization to Human Demonstrations in Few-Shot Imitation Learning

TL;DR

Abstract

Paper Structure (11 sections, 4 figures, 4 tables)

This paper contains 11 sections, 4 figures, 4 tables.

INTRODUCTION
Related Work
Approach
SFCr: Cross-Embodiment Scene Flow Prediction Model
FCrP: Flow and Croped Point Cloud Conditioned Policy
Real-World Evaluation
Flow Prediction Evaluation
Real-World Robot Manipulation
Failure Modes Analyze
Discussion
Conclusion

Figures (4)

Figure 1: Overview of our work. Left: We use 30 human videos and 10 robot demonstrations to train the cross-embodiment flow prediction model SFCr and the flow-conditioned policy FCrP. Right: The point cloud observation from a single third-person-view camera and the predicted flow during execution. The images on the right are the beginning and success states of each task.
Figure 2: The bowl position (warm color rectangles) and instances of Pick Bowl tasks. There are no robot demonstrations for #4-6.
Figure 3: Logarithmic scale flow ADE and FDE over five seeds.
Figure 4: Max-pooling referenced points (red) in Open Drawer.

Flow-Enabled Generalization to Human Demonstrations in Few-Shot Imitation Learning

TL;DR

Abstract

Flow-Enabled Generalization to Human Demonstrations in Few-Shot Imitation Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)