WorldWander: Bridging Egocentric and Exocentric Worlds in Video Generation
Quanjian Song, Yiren Song, Kelly Peng, Yuan Gao, Mike Zheng Shou
TL;DR
Problem: translating videos between egocentric and exocentric perspectives is challenging due to synchronization and identity consistency, and existing methods rely on geometry or camera poses. Approach: WorldWander uses in-context learning on triplets to learn cross-view mappings without camera pose, via In-Context Perspective Alignment and Collaborative Position Encoding, fine-tuned efficiently with LoRA on a Wan2.2-5B diffusion backbone, and augmented with the EgoExo-8K dataset. Contributions: a geometry-free translation framework, two novel cross-view modeling modules, a large-scale paired EgoExo-8K dataset, and comprehensive evaluation showing superior cross-view synchronization, character consistency, and generalization. Significance: enables immersive, character-centric video synthesis for filmmaking, embodied AI, and world-model applications, with practical data-efficient training and a standardized benchmark for egocentric–exocentric translation.
Abstract
Video diffusion models have recently achieved remarkable progress in realism and controllability. However, achieving seamless video translation across different perspectives, such as first-person (egocentric) and third-person (exocentric), remains underexplored. Bridging these perspectives is crucial for filmmaking, embodied AI, and world models. Motivated by this, we present WorldWander, an in-context learning framework tailored for translating between egocentric and exocentric worlds in video generation. Building upon advanced video diffusion transformers, WorldWander integrates (i) In-Context Perspective Alignment and (ii) Collaborative Position Encoding to efficiently model cross-view synchronization. To further support our task, we curate EgoExo-8K, a large-scale dataset containing synchronized egocentric-exocentric triplets from both synthetic and real-world scenarios. Experiments demonstrate that WorldWander achieves superior perspective synchronization, character consistency, and generalization, setting a new benchmark for egocentric-exocentric video translation.
