Table of Contents
Fetching ...

EgoX: Egocentric Video Generation from a Single Exocentric Video

Taewoong Kang, Kinam Kim, Dohyeon Kim, Minho Park, Junha Hyung, Jaegul Choo

TL;DR

EgoX tackles exocentric-to-egocentric video generation from a single input by lifting the exocentric scene into a 3D representation to produce an egocentric prior, then fusing this with the exocentric stream inside a pretrained video diffusion model using a unified width-wise and channel-wise conditioning. A geometry-guided self-attention module biases attention toward geometrically corresponding regions, enabling coherent, high-fidelity egocentric outputs even under extreme viewpoint changes. The approach, trained with lightweight LoRA adaptations, demonstrates strong generalization to unseen scenes and in-the-wild data, outperforming state-of-the-art baselines on both image/object and video metrics, and is supported by extensive ablations. Limitations include reliance on egocentric pose inputs, with future work planned for automatic head-pose estimation to enable fully automatic generation.

Abstract

Egocentric perception enables humans to experience and understand the world directly from their own point of view. Translating exocentric (third-person) videos into egocentric (first-person) videos opens up new possibilities for immersive understanding but remains highly challenging due to extreme camera pose variations and minimal view overlap. This task requires faithfully preserving visible content while synthesizing unseen regions in a geometrically consistent manner. To achieve this, we present EgoX, a novel framework for generating egocentric videos from a single exocentric input. EgoX leverages the pretrained spatio temporal knowledge of large-scale video diffusion models through lightweight LoRA adaptation and introduces a unified conditioning strategy that combines exocentric and egocentric priors via width and channel wise concatenation. Additionally, a geometry-guided self-attention mechanism selectively attends to spatially relevant regions, ensuring geometric coherence and high visual fidelity. Our approach achieves coherent and realistic egocentric video generation while demonstrating strong scalability and robustness across unseen and in-the-wild videos.

EgoX: Egocentric Video Generation from a Single Exocentric Video

TL;DR

EgoX tackles exocentric-to-egocentric video generation from a single input by lifting the exocentric scene into a 3D representation to produce an egocentric prior, then fusing this with the exocentric stream inside a pretrained video diffusion model using a unified width-wise and channel-wise conditioning. A geometry-guided self-attention module biases attention toward geometrically corresponding regions, enabling coherent, high-fidelity egocentric outputs even under extreme viewpoint changes. The approach, trained with lightweight LoRA adaptations, demonstrates strong generalization to unseen scenes and in-the-wild data, outperforming state-of-the-art baselines on both image/object and video metrics, and is supported by extensive ablations. Limitations include reliance on egocentric pose inputs, with future work planned for automatic head-pose estimation to enable fully automatic generation.

Abstract

Egocentric perception enables humans to experience and understand the world directly from their own point of view. Translating exocentric (third-person) videos into egocentric (first-person) videos opens up new possibilities for immersive understanding but remains highly challenging due to extreme camera pose variations and minimal view overlap. This task requires faithfully preserving visible content while synthesizing unseen regions in a geometrically consistent manner. To achieve this, we present EgoX, a novel framework for generating egocentric videos from a single exocentric input. EgoX leverages the pretrained spatio temporal knowledge of large-scale video diffusion models through lightweight LoRA adaptation and introduces a unified conditioning strategy that combines exocentric and egocentric priors via width and channel wise concatenation. Additionally, a geometry-guided self-attention mechanism selectively attends to spatially relevant regions, ensuring geometric coherence and high visual fidelity. Our approach achieves coherent and realistic egocentric video generation while demonstrating strong scalability and robustness across unseen and in-the-wild videos.

Paper Structure

This paper contains 35 sections, 9 equations, 18 figures, 6 tables.

Figures (18)

  • Figure 1: Given a single exocentric video, EgoX generates what the scene would look like from the actor’s eyes. Shown with an in-the-wild clip from The Dark Knight, our approach achieves realistic and generalizable egocentric generation.
  • Figure 2: Exo-to-Ego view generation example. The model has to preserve view-related content from the exocentric input, generate uninformed regions realistically, and ignore unrelated areas for consistent egocentric synthesis.
  • Figure 3: Overall pipeline. Given an exocentric video input, we first lift it into a 3D point cloud and render the scene from the egocentric viewpoint to obtain the egocentric prior video. The clean exocentric video latent and the egocentric prior latent are combined via width-wise and channel-wise concatenation in the latent space, and then fed into a pretrained video diffusion model equipped with the proposed geometry-guided self-attention.
  • Figure 4: Geometry-Guided Self-Attention Overview. 3D direction similarities between egocentric queries and exocentric keys are used as an additive bias in the attention map, guiding the model to focus on geometrically aligned regions. Although the orange and red directions are the same key tokens, their directions differ due to different camera centers. The blue–red pairs have similar directions and thus receive higher scores, whereas the green–orange pairs have opposite directions and obtain lower scores.
  • Figure 5: Qualitative comparison. Each example shows the exocentric input views and the corresponding generated egocentric views. While other methods fail to reconstruct realistic and coherent videos, our approach produces geometrically accurate and high-quality egocentric generations. N/A indicates that the result is unavailable either due to missing ground truth or the need for additional input views.
  • ...and 13 more figures