Table of Contents
Fetching ...

PlayerOne: Egocentric World Simulator

Yuanpeng Tu, Hao Luo, Xi Chen, Xiang Bai, Fan Wang, Hengshuang Zhao

TL;DR

PlayerOne introduces a pioneering egocentric world simulator that generates long, unrestricted videos aligned with real user motion by combining a diffusion transformer with part-disentangled motion injection and a joint 4D scene–frame reconstruction. It leverages a coarse-to-fine training strategy and an automatic dataset construction pipeline to bridge the gap between real-world egocentric data and synthetic video generation, achieving real-time performance through distillation. The approach demonstrates strong generalization across diverse scenes and motions, outperforming existing world-modeling baselines on motion alignment and video fidelity, with comprehensive ablations and user studies. This work opens new avenues for interactive, realistic world modeling in robotics, simulation, and gaming contexts.

Abstract

We introduce PlayerOne, the first egocentric realistic world simulator, facilitating immersive and unrestricted exploration within vividly dynamic environments. Given an egocentric scene image from the user, PlayerOne can accurately construct the corresponding world and generate egocentric videos that are strictly aligned with the real scene human motion of the user captured by an exocentric camera. PlayerOne is trained in a coarse-to-fine pipeline that first performs pretraining on large-scale egocentric text-video pairs for coarse-level egocentric understanding, followed by finetuning on synchronous motion-video data extracted from egocentric-exocentric video datasets with our automatic construction pipeline. Besides, considering the varying importance of different components, we design a part-disentangled motion injection scheme, enabling precise control of part-level movements. In addition, we devise a joint reconstruction framework that progressively models both the 4D scene and video frames, ensuring scene consistency in the long-form video generation. Experimental results demonstrate its great generalization ability in precise control of varying human movements and worldconsistent modeling of diverse scenarios. It marks the first endeavor into egocentric real-world simulation and can pave the way for the community to delve into fresh frontiers of world modeling and its diverse applications.

PlayerOne: Egocentric World Simulator

TL;DR

PlayerOne introduces a pioneering egocentric world simulator that generates long, unrestricted videos aligned with real user motion by combining a diffusion transformer with part-disentangled motion injection and a joint 4D scene–frame reconstruction. It leverages a coarse-to-fine training strategy and an automatic dataset construction pipeline to bridge the gap between real-world egocentric data and synthetic video generation, achieving real-time performance through distillation. The approach demonstrates strong generalization across diverse scenes and motions, outperforming existing world-modeling baselines on motion alignment and video fidelity, with comprehensive ablations and user studies. This work opens new avenues for interactive, realistic world modeling in robotics, simulation, and gaming contexts.

Abstract

We introduce PlayerOne, the first egocentric realistic world simulator, facilitating immersive and unrestricted exploration within vividly dynamic environments. Given an egocentric scene image from the user, PlayerOne can accurately construct the corresponding world and generate egocentric videos that are strictly aligned with the real scene human motion of the user captured by an exocentric camera. PlayerOne is trained in a coarse-to-fine pipeline that first performs pretraining on large-scale egocentric text-video pairs for coarse-level egocentric understanding, followed by finetuning on synchronous motion-video data extracted from egocentric-exocentric video datasets with our automatic construction pipeline. Besides, considering the varying importance of different components, we design a part-disentangled motion injection scheme, enabling precise control of part-level movements. In addition, we devise a joint reconstruction framework that progressively models both the 4D scene and video frames, ensuring scene consistency in the long-form video generation. Experimental results demonstrate its great generalization ability in precise control of varying human movements and worldconsistent modeling of diverse scenarios. It marks the first endeavor into egocentric real-world simulation and can pave the way for the community to delve into fresh frontiers of world modeling and its diverse applications.

Paper Structure

This paper contains 12 sections, 4 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Simulated videos of our PlayerOne. Given an egocentric image as the scene to be explored, we can simulate egocentric immersive videos that are accurately aligned with the user's motion sequence captured by an exocentric camera. All the users have been anonymized and action videos are shot with the front camera.
  • Figure 2: Overall framework of our PlayerOne. It begins by converting the egocentric first frame into visual tokens. The human motion sequence is split into groups and fed into the motion encoders respectively to generate part-wise motion latents, with the head parameters converted into a rotation-only camera sequence. This camera sequence is then encoded via a camera encoder, and its output is injected into noised video latents to improve view-change alignment. Next, we render a 4D scene point map sequence with the ground truth video, which is then processed by a point map encoder with an adapter to produce scene latents. Then we input the concatenation of these latents into the DiT Model and perform noising and denoising on both the video and scene latents to ensure world-consistent generation. Finally, the denoised latents are decoded by VAE decoders to produce the final results. Note that only the first frame and the human motion sequence are needed for inference.
  • Figure 3: The overall pipeline of the dataset construction. By seamlessly integrating detection and human pose estimation models, we can extract motion-video pairs from existing egocentric-exocentric video datasets while retaining high-quality data through our automatic filtering scheme.
  • Figure 4: Investigation on coarse-to-fine training. "Joint-Train" and "No Pretrain" denote training with both motion-video pairs and large-scale egocentric videos in a one-stage manner and training with only motion-video pairs respectively. The Wanx2.1 1.3B is adopted as the baseline.
  • Figure 5: Investigation on part-disentangled motion injection. "ControlNet" denotes injecting motion latents with a ControlNet zhang2023adding. "Entangled" and "No Cam" denote inputting the whole motion sequence into a motion encoder without dividing into groups and removing the camera encoder respectively.
  • ...and 3 more figures