EgoX: Egocentric Video Generation from a Single Exocentric Video
Taewoong Kang, Kinam Kim, Dohyeon Kim, Minho Park, Junha Hyung, Jaegul Choo
TL;DR
EgoX tackles exocentric-to-egocentric video generation from a single input by lifting the exocentric scene into a 3D representation to produce an egocentric prior, then fusing this with the exocentric stream inside a pretrained video diffusion model using a unified width-wise and channel-wise conditioning. A geometry-guided self-attention module biases attention toward geometrically corresponding regions, enabling coherent, high-fidelity egocentric outputs even under extreme viewpoint changes. The approach, trained with lightweight LoRA adaptations, demonstrates strong generalization to unseen scenes and in-the-wild data, outperforming state-of-the-art baselines on both image/object and video metrics, and is supported by extensive ablations. Limitations include reliance on egocentric pose inputs, with future work planned for automatic head-pose estimation to enable fully automatic generation.
Abstract
Egocentric perception enables humans to experience and understand the world directly from their own point of view. Translating exocentric (third-person) videos into egocentric (first-person) videos opens up new possibilities for immersive understanding but remains highly challenging due to extreme camera pose variations and minimal view overlap. This task requires faithfully preserving visible content while synthesizing unseen regions in a geometrically consistent manner. To achieve this, we present EgoX, a novel framework for generating egocentric videos from a single exocentric input. EgoX leverages the pretrained spatio temporal knowledge of large-scale video diffusion models through lightweight LoRA adaptation and introduces a unified conditioning strategy that combines exocentric and egocentric priors via width and channel wise concatenation. Additionally, a geometry-guided self-attention mechanism selectively attends to spatially relevant regions, ensuring geometric coherence and high visual fidelity. Our approach achieves coherent and realistic egocentric video generation while demonstrating strong scalability and robustness across unseen and in-the-wild videos.
