EgoAVFlow: Robot Policy Learning with Active Vision from Human Egocentric Videos via 3D Flow

Daesol Cho; Youngseok Jang; Danfei Xu; Sehoon Ha

EgoAVFlow: Robot Policy Learning with Active Vision from Human Egocentric Videos via 3D Flow

Daesol Cho, Youngseok Jang, Danfei Xu, Sehoon Ha

TL;DR

EgoAVFlow is proposed, which learns manipulation and active vision from egocentric videos through a shared 3D flow representation that supports geometric visibility reasoning and transfers without robot demonstrations.

Abstract

Egocentric human videos provide a scalable source of manipulation demonstrations; however, deploying them on robots requires active viewpoint control to maintain task-critical visibility, which human viewpoint imitation often fails to provide due to human-specific priors. We propose EgoAVFlow, which learns manipulation and active vision from egocentric videos through a shared 3D flow representation that supports geometric visibility reasoning and transfers without robot demonstrations. EgoAVFlow uses diffusion models to predict robot actions, future 3D flow, and camera trajectories, and refines viewpoints at test time with reward-maximizing denoising under a visibility-aware reward computed from predicted motion and scene geometry. Real-world experiments under actively changing viewpoints show that EgoAVFlow consistently outperforms prior human-demo-based baselines, demonstrating effective visibility maintenance and robust manipulation without robot demonstrations.

EgoAVFlow: Robot Policy Learning with Active Vision from Human Egocentric Videos via 3D Flow

TL;DR

Abstract

Paper Structure (29 sections, 14 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 29 sections, 14 equations, 7 figures, 1 table, 1 algorithm.

Introduction
Related Works
Preliminary
Data pre-processing
Robot data from egocentric human video
Scene description via 3D flow
Marker coordinate representation
Soft Value-Based Denoising for Reward Maximizing Diffusion
Reward-tilted target distribution
Soft value as a look-ahead score
Value-weighted denoising process
Denoising via per-step importance resampling
EgoAVFlow: Policy learning from egocentric human videos with active vision via a 3D flow
Overall Framework
Robot policy $\pi_r$
...and 14 more sections

Figures (7)

Figure 1: EgoAVFlow learns manipulation and active viewpoint control from egocentric human videos by predicting future 3D flow and optimizing camera viewpoints for visibility, yielding viewpoint-robust robot execution without robot demonstrations.
Figure 2: Method overview. EgoAVFlow consists of three diffusion models. The robot policy $\pi_r$ produces future robot action sequences. The flow generation model $f$ predicts future 3D flows from the outputs of $\pi_r$. The view policy $\pi_v$ produces future camera viewpoints from the outputs of $\pi_r$, $f$, and reconstructed mesh surfaces through a visibility-aware reward-maximizing denoising process. Viewpoints (A) represent that most query points are invisible (Red LOS) due to the table's mesh surface or out of FoV, whereas in viewpoints (B) these points are visible (Green LOS), yielding a higher visibility reward.
Figure 3: Tasks. Each task requires appropriate viewpoint adjustments. Otherwise, the object is occluded by the robot or elements in the environment, such as a table or drawer.
Figure 4: Visibility comparison (best viewed in the digital version). The visibility is computed from each different fixed viewpoint. No single viewpoint can maintain full visibility throughout the execution, indicating that the viewpoint must be continuously adjusted online to maximize visibility.
Figure 5: Visibility reward. For all tasks, EgoAVFlow achieves higher average visibility rewards $R_{vis}$ than HVI, demonstrating our method's visibility maintenance capability. The error bars represent 1 standard error.
...and 2 more figures

EgoAVFlow: Robot Policy Learning with Active Vision from Human Egocentric Videos via 3D Flow

TL;DR

Abstract

EgoAVFlow: Robot Policy Learning with Active Vision from Human Egocentric Videos via 3D Flow

Authors

TL;DR

Abstract

Table of Contents

Figures (7)