Table of Contents
Fetching ...

Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints

Chenyangguang Zhang, Botao Ye, Boqi Chen, Alexandros Delitzas, Fangjinhua Wang, Marc Pollefeys, Xi Wang

Abstract

Motion-controllable video generation is crucial for egocentric applications in virtual reality and embodied AI. However, existing methods often struggle to achieve 3D-consistent fine-grained hand articulation. By adopting on 2D trajectories or implicit poses, they collapse 3D geometry into spatially ambiguous signals or over rely on human-centric priors. Under severe egocentric occlusions, this causes motion inconsistencies and hallucinated artifacts, as well as preventing cross-embodiment generalization to robotic hands. To address these limitations, we propose a novel framework that generates egocentric videos from a single reference frame, leveraging sparse 3D hand joints as embodiment-agnostic control signals with clear semantic and geometric structures. We introduce an efficient control module that resolves occlusion ambiguities while fully preserving 3D information. Specifically, it extracts occlusion-aware features from the source reference frame by penalizing unreliable visual signals from hidden joints, and employs a 3D-based weighting mechanism to robustly handle dynamically occluded target joints during motion propagation. Concurrently, the module directly injects 3D geometric embeddings into the latent space to strictly enforce structural consistency. To facilitate robust training and evaluation, we develop an automated annotation pipeline that yields over one million high-quality egocentric video clips paired with precise hand trajectories. Additionally, we register humanoid kinematic and camera data to construct a cross-embodiment benchmark. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, generating high-fidelity egocentric videos with realistic interactions and exhibiting exceptional cross-embodiment generalization to robotic hands.

Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints

Abstract

Motion-controllable video generation is crucial for egocentric applications in virtual reality and embodied AI. However, existing methods often struggle to achieve 3D-consistent fine-grained hand articulation. By adopting on 2D trajectories or implicit poses, they collapse 3D geometry into spatially ambiguous signals or over rely on human-centric priors. Under severe egocentric occlusions, this causes motion inconsistencies and hallucinated artifacts, as well as preventing cross-embodiment generalization to robotic hands. To address these limitations, we propose a novel framework that generates egocentric videos from a single reference frame, leveraging sparse 3D hand joints as embodiment-agnostic control signals with clear semantic and geometric structures. We introduce an efficient control module that resolves occlusion ambiguities while fully preserving 3D information. Specifically, it extracts occlusion-aware features from the source reference frame by penalizing unreliable visual signals from hidden joints, and employs a 3D-based weighting mechanism to robustly handle dynamically occluded target joints during motion propagation. Concurrently, the module directly injects 3D geometric embeddings into the latent space to strictly enforce structural consistency. To facilitate robust training and evaluation, we develop an automated annotation pipeline that yields over one million high-quality egocentric video clips paired with precise hand trajectories. Additionally, we register humanoid kinematic and camera data to construct a cross-embodiment benchmark. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, generating high-fidelity egocentric videos with realistic interactions and exhibiting exceptional cross-embodiment generalization to robotic hands.
Paper Structure (15 sections, 14 equations, 8 figures, 3 tables)

This paper contains 15 sections, 14 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Comparison with existing controllable video generation methods. We leverage sparse 3D hand joints as a cross-embodiment control signal, enabled by our occlusion-aware conditioning and 3D geometric embeddings. Consequently, our method generates high-fidelity consistent hands that align with complex input motions under severe occlusion. In contrast, pose-based methods struggle to generalize across diverse embodiments, and track-based methods exhibit weak spatial control.
  • Figure 2: Method overview. Our framework uses sparse 3D hand joints to represent motions by constructing two embedding streams. The occlusion-aware motion feature is yielded by first penalizing occluded regions to extract reliable context from the source frame, and then propagating it with modulating 3D-aware feature weights to handle target occlusion. The 3D geometric embedding is formed by processing this motion feature along with 3D joint coordinates and semantic embeddings through a Causal Conv3D block. Finally, both embeddings are concatenated with the noisy latent and fed into LoRA-adapted DiT blocks.
  • Figure 3: Qualitative results of our data annotations. The last two images are zoomed in for clear visualization of hand tracking accuracy.
  • Figure 4: The user study win rates.
  • Figure 5: Qualitative comparisons. Compared with state-of-the-art WAN-Fun Wan2_1_Fun_Control2025 and WAN-Move$^*$chu2025wan, our method shows better video quality with accurate hand control.
  • ...and 3 more figures