Table of Contents
Fetching ...

Egocentric World Model for Photorealistic Hand-Object Interaction Synthesis

Dayou Li, Lulin Liu, Bangya Liu, Shijie Zhou, Jiu Feng, Ziqi Lu, Minghui Zheng, Chenyu You, Zhiwen Fan

Abstract

To serve as a scalable data source for embodied AI, world models should act as true simulators that infer interaction dynamics strictly from user actions, rather than mere conditional video generators relying on privileged future object states. In this context, egocentric Human-Object Interaction (HOI) world models are critical for predicting physically grounded first-person rollouts. However, building such models is profoundly challenging due to rapid head motions, severe occlusions, and high-DoF hand articulations that abruptly alter contact topologies. Consequently, existing approaches often circumvent these physics challenges by resorting to conditional video generation with access to known future object trajectories. We introduce EgoHOI, an egocentric HOI world model that breaks away from this shortcut to simulate photorealistic, contact-consistent interactions from action signals alone. To ensure physical accuracy without future-state inputs, EgoHOI distills geometric and kinematic priors from 3D estimates into physics-informed embeddings. These embeddings regularize the egocentric rollouts toward physically valid dynamics. Experiments on the HOT3D dataset demonstrate consistent gains over strong baselines, and ablations validate the effectiveness of our physics-informed design.

Egocentric World Model for Photorealistic Hand-Object Interaction Synthesis

Abstract

To serve as a scalable data source for embodied AI, world models should act as true simulators that infer interaction dynamics strictly from user actions, rather than mere conditional video generators relying on privileged future object states. In this context, egocentric Human-Object Interaction (HOI) world models are critical for predicting physically grounded first-person rollouts. However, building such models is profoundly challenging due to rapid head motions, severe occlusions, and high-DoF hand articulations that abruptly alter contact topologies. Consequently, existing approaches often circumvent these physics challenges by resorting to conditional video generation with access to known future object trajectories. We introduce EgoHOI, an egocentric HOI world model that breaks away from this shortcut to simulate photorealistic, contact-consistent interactions from action signals alone. To ensure physical accuracy without future-state inputs, EgoHOI distills geometric and kinematic priors from 3D estimates into physics-informed embeddings. These embeddings regularize the egocentric rollouts toward physically valid dynamics. Experiments on the HOT3D dataset demonstrate consistent gains over strong baselines, and ablations validate the effectiveness of our physics-informed design.
Paper Structure (59 sections, 22 equations, 11 figures, 8 tables)

This paper contains 59 sections, 22 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Qualitative comparison between a baseline world model and EgoHOI. Starting from the same first frame, EgoHOI integrates physics-informed embeddings to model hand–object interaction dynamics, improving ego-motion consistency, kinematic fidelity, and object integrity over time. Zoom-ins highlight clearer contact details and reduced drift.
  • Figure 2: Overview of EgoHOI pipeline. We formulate EgoHOI as an egocentric world model that represents frames with a latent internal state and predicts action-driven transitions with a DiT backbone. Physics-informed embeddings distilled from reconstruction-based 3D priors, together with the first-frame object appearance, are integrated into the latent dynamics via lightweight adapters, enabling realistic hand–object interactions, geometry-consistent ego-motion, and stable object identity under viewpoint changes.
  • Figure 3: Qualitative comparison with baselines. Columns from left to right show the ground truth (GT), Wan, Cosmos-2B, Cosmos-14B, Uni3C, and our EgoHOI model. All methods receive the same first frame as input; compared with the baselines, our model better preserves hand and object geometry, maintains object identity, and produces more stable interaction dynamics over time. See the zooming boxes for comparison of the fine-grained details.
  • Figure 4: Qualitative results for ablation on kinematic fidelity. Rows: GT, Base, HKE Injection, Ours (EgoHOI). HKE improves hand articulation stability and reduces local shape distortion. Zoom-in boxes highlight clearer hand–object contact details. The full model is the most stable, matching MR/MPJPE/RMSE.
  • Figure 5: Qualitative results for ablation of ego-motion consistency Rows: GT, Base, EME Injection, Ours (EgoHOI). EME mitigates long-horizon ego-motion drift and improves viewpoint stability. Zoom-ins highlight cleaner details under viewpoint changes. The full model best matches ATE/RRE/RPE gains.
  • ...and 6 more figures