Table of Contents
Fetching ...

WorldWander: Bridging Egocentric and Exocentric Worlds in Video Generation

Quanjian Song, Yiren Song, Kelly Peng, Yuan Gao, Mike Zheng Shou

TL;DR

Problem: translating videos between egocentric and exocentric perspectives is challenging due to synchronization and identity consistency, and existing methods rely on geometry or camera poses. Approach: WorldWander uses in-context learning on triplets to learn cross-view mappings without camera pose, via In-Context Perspective Alignment and Collaborative Position Encoding, fine-tuned efficiently with LoRA on a Wan2.2-5B diffusion backbone, and augmented with the EgoExo-8K dataset. Contributions: a geometry-free translation framework, two novel cross-view modeling modules, a large-scale paired EgoExo-8K dataset, and comprehensive evaluation showing superior cross-view synchronization, character consistency, and generalization. Significance: enables immersive, character-centric video synthesis for filmmaking, embodied AI, and world-model applications, with practical data-efficient training and a standardized benchmark for egocentric–exocentric translation.

Abstract

Video diffusion models have recently achieved remarkable progress in realism and controllability. However, achieving seamless video translation across different perspectives, such as first-person (egocentric) and third-person (exocentric), remains underexplored. Bridging these perspectives is crucial for filmmaking, embodied AI, and world models. Motivated by this, we present WorldWander, an in-context learning framework tailored for translating between egocentric and exocentric worlds in video generation. Building upon advanced video diffusion transformers, WorldWander integrates (i) In-Context Perspective Alignment and (ii) Collaborative Position Encoding to efficiently model cross-view synchronization. To further support our task, we curate EgoExo-8K, a large-scale dataset containing synchronized egocentric-exocentric triplets from both synthetic and real-world scenarios. Experiments demonstrate that WorldWander achieves superior perspective synchronization, character consistency, and generalization, setting a new benchmark for egocentric-exocentric video translation.

WorldWander: Bridging Egocentric and Exocentric Worlds in Video Generation

TL;DR

Problem: translating videos between egocentric and exocentric perspectives is challenging due to synchronization and identity consistency, and existing methods rely on geometry or camera poses. Approach: WorldWander uses in-context learning on triplets to learn cross-view mappings without camera pose, via In-Context Perspective Alignment and Collaborative Position Encoding, fine-tuned efficiently with LoRA on a Wan2.2-5B diffusion backbone, and augmented with the EgoExo-8K dataset. Contributions: a geometry-free translation framework, two novel cross-view modeling modules, a large-scale paired EgoExo-8K dataset, and comprehensive evaluation showing superior cross-view synchronization, character consistency, and generalization. Significance: enables immersive, character-centric video synthesis for filmmaking, embodied AI, and world-model applications, with practical data-efficient training and a standardized benchmark for egocentric–exocentric translation.

Abstract

Video diffusion models have recently achieved remarkable progress in realism and controllability. However, achieving seamless video translation across different perspectives, such as first-person (egocentric) and third-person (exocentric), remains underexplored. Bridging these perspectives is crucial for filmmaking, embodied AI, and world models. Motivated by this, we present WorldWander, an in-context learning framework tailored for translating between egocentric and exocentric worlds in video generation. Building upon advanced video diffusion transformers, WorldWander integrates (i) In-Context Perspective Alignment and (ii) Collaborative Position Encoding to efficiently model cross-view synchronization. To further support our task, we curate EgoExo-8K, a large-scale dataset containing synchronized egocentric-exocentric triplets from both synthetic and real-world scenarios. Experiments demonstrate that WorldWander achieves superior perspective synchronization, character consistency, and generalization, setting a new benchmark for egocentric-exocentric video translation.

Paper Structure

This paper contains 23 sections, 8 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Gallery of WorldWander. It bridges the egocentric and exocentric worlds in video generation, enabling immersive exploration.
  • Figure 2: Overall pipeline of WorldWander. The backbone Wan2.2-5B wan2025wan is fine-tuned with proposed In-Context Perspective Alignment (Shared Latent Space, Different Noise Levels, and Collaborative Attention) as well as Collaborative Position Encoding.
  • Figure 3: Comparison of fine-tuning loss across different approaches on synthetic scenarios. Collaborative Attention demonstrates faster convergence than Channel-Wise Concatenation.
  • Figure 4: Showcase of our curated EgoExo-8K. It features diverse synthetic and real-world indoor and outdoor scenarios.
  • Figure 5: User study of different methods on both the exocentric-to-egocentric and egocentric-to-exocentric translation tasks. We report average results of the synthetic and real-world scenarios.
  • ...and 7 more figures