PlayerOne: Egocentric World Simulator
Yuanpeng Tu, Hao Luo, Xi Chen, Xiang Bai, Fan Wang, Hengshuang Zhao
TL;DR
PlayerOne introduces a pioneering egocentric world simulator that generates long, unrestricted videos aligned with real user motion by combining a diffusion transformer with part-disentangled motion injection and a joint 4D scene–frame reconstruction. It leverages a coarse-to-fine training strategy and an automatic dataset construction pipeline to bridge the gap between real-world egocentric data and synthetic video generation, achieving real-time performance through distillation. The approach demonstrates strong generalization across diverse scenes and motions, outperforming existing world-modeling baselines on motion alignment and video fidelity, with comprehensive ablations and user studies. This work opens new avenues for interactive, realistic world modeling in robotics, simulation, and gaming contexts.
Abstract
We introduce PlayerOne, the first egocentric realistic world simulator, facilitating immersive and unrestricted exploration within vividly dynamic environments. Given an egocentric scene image from the user, PlayerOne can accurately construct the corresponding world and generate egocentric videos that are strictly aligned with the real scene human motion of the user captured by an exocentric camera. PlayerOne is trained in a coarse-to-fine pipeline that first performs pretraining on large-scale egocentric text-video pairs for coarse-level egocentric understanding, followed by finetuning on synchronous motion-video data extracted from egocentric-exocentric video datasets with our automatic construction pipeline. Besides, considering the varying importance of different components, we design a part-disentangled motion injection scheme, enabling precise control of part-level movements. In addition, we devise a joint reconstruction framework that progressively models both the 4D scene and video frames, ensuring scene consistency in the long-form video generation. Experimental results demonstrate its great generalization ability in precise control of varying human movements and worldconsistent modeling of diverse scenarios. It marks the first endeavor into egocentric real-world simulation and can pave the way for the community to delve into fresh frontiers of world modeling and its diverse applications.
