Table of Contents
Fetching ...

Exo2EgoSyn: Unlocking Foundation Video Generation Models for Exocentric-to-Egocentric Video Synthesis

Mohammad Mahdi, Yuqian Fu, Nedko Savov, Jiancheng Pan, Danda Pani Paudel, Luc Van Gool

TL;DR

This work addresses exocentric-to-egocentric video generation by adapting a large-scale foundation video diffusion model (WAN2.2) to cross-view synthesis. It introduces three key components—EgoExo-Align to align ego-first-frame latent representations with exocentric views, MultiExoCon to condition on multiple exocentric videos, and PoseInj to inject relative camera pose information via Plücker embeddings—implemented in a two-stage fine-tuning pipeline. Experiments on the Ego-Exo4D benchmark show consistent improvements over a strong baseline (VAWAN) in PSNR, SSIM, and LPIPS, complemented by a user study favoring Exo2EgoSyn. The work demonstrates that foundation models can be repurposed for cross-view video generation with scalable conditioning signals, enabling practical exocentric-to-egocentric synthesis without training from scratch.

Abstract

Foundation video generation models such as WAN 2.2 exhibit strong text- and image-conditioned synthesis abilities but remain constrained to the same-view generation setting. In this work, we introduce Exo2EgoSyn, an adaptation of WAN 2.2 that unlocks Exocentric-to-Egocentric(Exo2Ego) cross-view video synthesis. Our framework consists of three key modules. Ego-Exo View Alignment(EgoExo-Align) enforces latent-space alignment between exocentric and egocentric first-frame representations, reorienting the generative space from the given exo view toward the ego view. Multi-view Exocentric Video Conditioning (MultiExoCon) aggregates multi-view exocentric videos into a unified conditioning signal, extending WAN2.2 beyond its vanilla single-image or text conditioning. Furthermore, Pose-Aware Latent Injection (PoseInj) injects relative exo-to-ego camera pose information into the latent state, guiding geometry-aware synthesis across viewpoints. Together, these modules enable high-fidelity ego view video generation from third-person observations without retraining from scratch. Experiments on ExoEgo4D validate that Exo2EgoSyn significantly improves Ego2Exo synthesis, paving the way for scalable cross-view video generation with foundation models. Source code and models will be released publicly.

Exo2EgoSyn: Unlocking Foundation Video Generation Models for Exocentric-to-Egocentric Video Synthesis

TL;DR

This work addresses exocentric-to-egocentric video generation by adapting a large-scale foundation video diffusion model (WAN2.2) to cross-view synthesis. It introduces three key components—EgoExo-Align to align ego-first-frame latent representations with exocentric views, MultiExoCon to condition on multiple exocentric videos, and PoseInj to inject relative camera pose information via Plücker embeddings—implemented in a two-stage fine-tuning pipeline. Experiments on the Ego-Exo4D benchmark show consistent improvements over a strong baseline (VAWAN) in PSNR, SSIM, and LPIPS, complemented by a user study favoring Exo2EgoSyn. The work demonstrates that foundation models can be repurposed for cross-view video generation with scalable conditioning signals, enabling practical exocentric-to-egocentric synthesis without training from scratch.

Abstract

Foundation video generation models such as WAN 2.2 exhibit strong text- and image-conditioned synthesis abilities but remain constrained to the same-view generation setting. In this work, we introduce Exo2EgoSyn, an adaptation of WAN 2.2 that unlocks Exocentric-to-Egocentric(Exo2Ego) cross-view video synthesis. Our framework consists of three key modules. Ego-Exo View Alignment(EgoExo-Align) enforces latent-space alignment between exocentric and egocentric first-frame representations, reorienting the generative space from the given exo view toward the ego view. Multi-view Exocentric Video Conditioning (MultiExoCon) aggregates multi-view exocentric videos into a unified conditioning signal, extending WAN2.2 beyond its vanilla single-image or text conditioning. Furthermore, Pose-Aware Latent Injection (PoseInj) injects relative exo-to-ego camera pose information into the latent state, guiding geometry-aware synthesis across viewpoints. Together, these modules enable high-fidelity ego view video generation from third-person observations without retraining from scratch. Experiments on ExoEgo4D validate that Exo2EgoSyn significantly improves Ego2Exo synthesis, paving the way for scalable cross-view video generation with foundation models. Source code and models will be released publicly.

Paper Structure

This paper contains 35 sections, 8 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: Adapting a video generator for Exo2Ego task with ExoEgoSyn. Top: Multi-view (4 exo, 1 ego) setup. Middle: The original generator cannot perform cross-view translation. Bottom: Our model ExoEgoSyn enables exo2ego cross-view translation.
  • Figure 2: Exo2EgoSyn Framework. We first propose the EgoExo-Align module, which predicts the latent representation of the first frame of the ego video. The MultiExoCon module replaces the original text conditioning in the foundation model with latent tokens derived from exocentric videos, injected into each DiT block of the diffusion model. During the second stage of fine-tuning, we concatenate the Plücker embeddings of the relative camera poses with the hidden states of the diffusion model before passing them to the first DiT block.
  • Figure 3: Qualitative Results. On the left, a scene with high camera motion is shown, while on the right, a more complex environment is selected to illustrate our model’s capabilities across different scenarios.
  • Figure 4: User Study Score Distribution. A clear preference to our model (score 1) is noticed, compared to the lower preference votes for baseline (score 0) or no preference (score 0.5).
  • Figure 5: WAN2.2's Bias of Replicating the Conditioned Image. On the left is the conditioning image. On the right, WAN2.2 generates five distinct videos based on different text prompts. The first three (blue arrows) use general prompts, while the last two (green arrows) use prompts tailored for our task. The text prompts used are as follows: $\mathcal{A}$: "Do start the video from a black frame.", $\mathcal{B}$: "Begin the video at a moment when the man is already in motion farther down the bike, as if some time has passed since the moment shown in the image.", $\mathcal{C}$: "The video must not begin with a scene resembling the supplied image. The opening frame should show a noticeably changed situation — the man in a different location or pose.", $\mathcal{D}$: "Produce a fully ego-centric video starting from the provided image. The camera must be strictly from the man’s eyes, showing the environment as he sees it — hands, body, and movement should appear naturally from a first-person perspective.", and $\mathcal{E}$: "Generate the ego-centric video from the image.".
  • ...and 7 more figures