Table of Contents
Fetching ...

WHOLE: World-Grounded Hand-Object Lifted from Egocentric Videos

Yufei Ye, Jiaman Li, Ryan Rong, C. Karen Liu

TL;DR

WHOLE is introduced, a method that holistically reconstructs hand and object motion in world space from egocentric videos given object templates to learn a generative prior over hand-object motion to jointly reason about their interactions.

Abstract

Egocentric manipulation videos are highly challenging due to severe occlusions during interactions and frequent object entries and exits from the camera view as the person moves. Current methods typically focus on recovering either hand or object pose in isolation, but both struggle during interactions and fail to handle out-of-sight cases. Moreover, their independent predictions often lead to inconsistent hand-object relations. We introduce WHOLE, a method that holistically reconstructs hand and object motion in world space from egocentric videos given object templates. Our key insight is to learn a generative prior over hand-object motion to jointly reason about their interactions. At test time, the pretrained prior is guided to generate trajectories that conform to the video observations. This joint generative reconstruction substantially outperforms approaches that process hands and objects separately followed by post-processing. WHOLE achieves state-of-the-art performance on hand motion estimation, 6D object pose estimation, and their relative interaction reconstruction. Project website: https://judyye.github.io/whole-www

WHOLE: World-Grounded Hand-Object Lifted from Egocentric Videos

TL;DR

WHOLE is introduced, a method that holistically reconstructs hand and object motion in world space from egocentric videos given object templates to learn a generative prior over hand-object motion to jointly reason about their interactions.

Abstract

Egocentric manipulation videos are highly challenging due to severe occlusions during interactions and frequent object entries and exits from the camera view as the person moves. Current methods typically focus on recovering either hand or object pose in isolation, but both struggle during interactions and fail to handle out-of-sight cases. Moreover, their independent predictions often lead to inconsistent hand-object relations. We introduce WHOLE, a method that holistically reconstructs hand and object motion in world space from egocentric videos given object templates. Our key insight is to learn a generative prior over hand-object motion to jointly reason about their interactions. At test time, the pretrained prior is guided to generate trajectories that conform to the video observations. This joint generative reconstruction substantially outperforms approaches that process hands and objects separately followed by post-processing. WHOLE achieves state-of-the-art performance on hand motion estimation, 6D object pose estimation, and their relative interaction reconstruction. Project website: https://judyye.github.io/whole-www
Paper Structure (40 sections, 2 equations, 5 figures, 5 tables)

This paper contains 40 sections, 2 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Given a metric-SLAMed egocentric video of a person interacting with the scene and the corresponding object templates, WHOLE reconstructs the motions of both hands and the objects of interest. The reconstruction is shown from the egocentric camera view, an allocentric view, and through the full 4D hand-object trajectories. In the example above, the person moves a box from the table on the left to the shelf on the right, takes a can from the shelf and places it on the center table, and finally picks up an orange box. For visual clarity, multiple objects are overlaid into a single scene, and the hands are displayed only when interacting with the box.
  • Figure 2: Reconstruction Using the Generative Motion Prior. Given a metric-SLAMed egocentric videos, and the object template $\bm O$, we alternate the diffusion generation step and the guidance step to predict hand motion $\bm H$, object 6D trajectory $\bm T$, and binary contact $\bm C$ as the final output $\bm x_0$. The diffusion model $D_\psi$ is conditioned on object geometry and approximated hand $\bar{\bm H}$ from off-the-shelf hand estimator to diffuse the noisy parameters $\bm x_n$. The guidance step refines the denoised output by optimizing task-specific objectives $g$ to be consistent with the video observations $\hat{\bm y}$ like 2D masks and contact. The contact labesl $\hat{\bm C}$ is automatically labeled by prompting a VLM.
  • Figure 3: Visual Prompt: We show two examples of the visual prompts provided to the VLM for contact detection.
  • Figure 4: HOI Generation Samples: We show two samples of interaction generated from our diffusion model with the same conditions. Objects are colored in red when contact is predicted and blue otherwise. We show 6 key frames among blended 150-frame generation.
  • Figure 5: HOI Visualization. We show hand-object reconstructions from GT (green), FP+HaWor-Simple (purple), FP+HaWor-Contact (pink), and WHOLE (blue). Red circle highlights floating objects. We encourage readers to see videos in Sup. Mat..