Table of Contents
Fetching ...

EgoSim: Egocentric World Simulator for Embodied Interaction Generation

Jinkun Hao, Mingda Jia, Ruiyan Wang, Xihui Liu, Ran Yi, Lizhuang Ma, Jiangmiao Pang, Xudong Xu

Abstract

We introduce EgoSim, a closed-loop egocentric world simulator that generates spatially consistent interaction videos and persistently updates the underlying 3D scene state for continuous simulation. Existing egocentric simulators either lack explicit 3D grounding, causing structural drift under viewpoint changes, or treat the scene as static, failing to update world states across multi-stage interactions. EgoSim addresses both limitations by modeling 3D scenes as updatable world states. We generate embodiment interactions via a Geometry-action-aware Observation Simulation model, with spatial consistency from an Interaction-aware State Updating module. To overcome the critical data bottleneck posed by the difficulty in acquiring densely aligned scene-interaction training pairs, we design a scalable pipeline that extracts static point clouds, camera trajectories, and embodiment actions from in-the-wild large-scale monocular egocentric videos. We further introduce EgoCap, a capture system that enables low-cost real-world data collection with uncalibrated smartphones. Extensive experiments demonstrate that EgoSim significantly outperforms existing methods in terms of visual quality, spatial consistency, and generalization to complex scenes and in-the-wild dexterous interactions, while supporting cross-embodiment transfer to robotic manipulation. Codes and datasets will be open soon. The project page is at egosimulator.github.io.

EgoSim: Egocentric World Simulator for Embodied Interaction Generation

Abstract

We introduce EgoSim, a closed-loop egocentric world simulator that generates spatially consistent interaction videos and persistently updates the underlying 3D scene state for continuous simulation. Existing egocentric simulators either lack explicit 3D grounding, causing structural drift under viewpoint changes, or treat the scene as static, failing to update world states across multi-stage interactions. EgoSim addresses both limitations by modeling 3D scenes as updatable world states. We generate embodiment interactions via a Geometry-action-aware Observation Simulation model, with spatial consistency from an Interaction-aware State Updating module. To overcome the critical data bottleneck posed by the difficulty in acquiring densely aligned scene-interaction training pairs, we design a scalable pipeline that extracts static point clouds, camera trajectories, and embodiment actions from in-the-wild large-scale monocular egocentric videos. We further introduce EgoCap, a capture system that enables low-cost real-world data collection with uncalibrated smartphones. Extensive experiments demonstrate that EgoSim significantly outperforms existing methods in terms of visual quality, spatial consistency, and generalization to complex scenes and in-the-wild dexterous interactions, while supporting cross-embodiment transfer to robotic manipulation. Codes and datasets will be open soon. The project page is at egosimulator.github.io.

Paper Structure

This paper contains 41 sections, 6 equations, 19 figures, 6 tables.

Figures (19)

  • Figure 1: Given a scene image and a sequence of actions, EgoSim generates temporally and spatially consistent egocentric observations and high-quality dexterous interactions. Egosim also persistently updates the 3D scene state for continuous simulation. We propose a data construction pipeline to leverage web-scale egocentric video data, and thus strengthen the generalization ability of Egosim with scalable scene-interaction pairs. EgoSim also exhibits strong few-shot adaptation capability to real-world scenarios and diverse robotic embodiments.
  • Figure 2: Overview of EgoSim. Our framework enables continuous egocentric simulation via updatable world modeling. It consists of (1) a Geometry-action-aware Observation Simulation model that synthesizes action-conditioned visual dynamics $O_k$ based on current scene geometry; and (2) an Interaction-aware state updating module that tracks point-based interaction-aware object states from generated observations and integrates them back into the persistent 3D world state to update $S_k$.
  • Figure 3: Implementation details of Interaction-aware State Updating. State Reconstruction (left) builds a point cloud from observations via modified VIPE. Object State Update (top-right) identifies and tracks interactive objects with a VLM and SAM3, compositing their latest geometry into the scene. Incremental State Fusion (bottom) aligns and merges consecutive states via TSDF fusion.
  • Figure 4: Overview of our scalable data processing pipeline. For both human egocentric videos and robotic manipulation videos, we automate the extraction of aligned triplets: static 3D scene point clouds, precise camera trajectories, and dynamic interaction sequences represented as spatial action keypoints.
  • Figure 5: The EgoCap pipeline. An uncalibrated head-mounted smartphone first scans the scene to build a 3DGS map, then records the ego-view interaction. Relocalizing against the map yields viewpoint-aligned paired data.
  • ...and 14 more figures