Exo2EgoSyn: Unlocking Foundation Video Generation Models for Exocentric-to-Egocentric Video Synthesis

Mohammad Mahdi; Yuqian Fu; Nedko Savov; Jiancheng Pan; Danda Pani Paudel; Luc Van Gool

Exo2EgoSyn: Unlocking Foundation Video Generation Models for Exocentric-to-Egocentric Video Synthesis

Mohammad Mahdi, Yuqian Fu, Nedko Savov, Jiancheng Pan, Danda Pani Paudel, Luc Van Gool

TL;DR

This work addresses exocentric-to-egocentric video generation by adapting a large-scale foundation video diffusion model (WAN2.2) to cross-view synthesis. It introduces three key components—EgoExo-Align to align ego-first-frame latent representations with exocentric views, MultiExoCon to condition on multiple exocentric videos, and PoseInj to inject relative camera pose information via Plücker embeddings—implemented in a two-stage fine-tuning pipeline. Experiments on the Ego-Exo4D benchmark show consistent improvements over a strong baseline (VAWAN) in PSNR, SSIM, and LPIPS, complemented by a user study favoring Exo2EgoSyn. The work demonstrates that foundation models can be repurposed for cross-view video generation with scalable conditioning signals, enabling practical exocentric-to-egocentric synthesis without training from scratch.

Abstract

Foundation video generation models such as WAN 2.2 exhibit strong text- and image-conditioned synthesis abilities but remain constrained to the same-view generation setting. In this work, we introduce Exo2EgoSyn, an adaptation of WAN 2.2 that unlocks Exocentric-to-Egocentric(Exo2Ego) cross-view video synthesis. Our framework consists of three key modules. Ego-Exo View Alignment(EgoExo-Align) enforces latent-space alignment between exocentric and egocentric first-frame representations, reorienting the generative space from the given exo view toward the ego view. Multi-view Exocentric Video Conditioning (MultiExoCon) aggregates multi-view exocentric videos into a unified conditioning signal, extending WAN2.2 beyond its vanilla single-image or text conditioning. Furthermore, Pose-Aware Latent Injection (PoseInj) injects relative exo-to-ego camera pose information into the latent state, guiding geometry-aware synthesis across viewpoints. Together, these modules enable high-fidelity ego view video generation from third-person observations without retraining from scratch. Experiments on ExoEgo4D validate that Exo2EgoSyn significantly improves Ego2Exo synthesis, paving the way for scalable cross-view video generation with foundation models. Source code and models will be released publicly.

Exo2EgoSyn: Unlocking Foundation Video Generation Models for Exocentric-to-Egocentric Video Synthesis

TL;DR

Abstract

Exo2EgoSyn: Unlocking Foundation Video Generation Models for Exocentric-to-Egocentric Video Synthesis

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)