Martian World Model: Controllable Video Synthesis with Physically Accurate 3D Reconstructions
Longfei Li, Zhiwen Fan, Wenyan Cong, Xinhang Liu, Yuyang Yin, Matt Foutter, Panwang Pan, Chenyu You, Yue Wang, Zhangyang Wang, Yao Zhao, Marco Pavone, Yunchao Wei
TL;DR
The paper tackles the scarcity of Martian data and domain mismatch by introducing M3arsSynth, a data engine that reconstructs metric-scale 3D Martian environments from NASA stereo imagery, and MarsGen, a controllable video generator trained on that data. Using a metric-aware initialization and 3D Gaussian Splatting, the approach yields photorealistic, 3D-consistent Martian videos conditioned on initial frames, trajectories, or text prompts. Experiments show MarsGen outperforms Earth-trained baselines in visual fidelity and geometric coherence, enabling realistic mission simulations for navigation, planning, and robotic training. The work provides a scalable, multimodal Martian dataset and a 3D-aware synthesis framework, with broader implications for planetary robotics and simulation, alongside considerations for misuse and resource demands.
Abstract
Synthesizing realistic Martian landscape videos is crucial for mission rehearsal and robotic simulation. However, this task poses unique challenges due to the scarcity of high-quality Martian data and the significant domain gap between Martian and terrestrial imagery. To address these challenges, we propose a holistic solution composed of two key components: 1) A data curation pipeline Multimodal Mars Synthesis (M3arsSynth), which reconstructs 3D Martian environments from real stereo navigation images, sourced from NASA's Planetary Data System (PDS), and renders high-fidelity multiview 3D video sequences. 2) A Martian terrain video generator, MarsGen, which synthesizes novel videos visually realistic and geometrically consistent with the 3D structure encoded in the data. Our M3arsSynth engine spans a wide range of Martian terrains and acquisition dates, enabling the generation of physically accurate 3D surface models at metric-scale resolution. MarsGen, fine-tuned on M3arsSynth data, synthesizes videos conditioned on an initial image frame and, optionally, camera trajectories or textual prompts, allowing for video generation in novel environments. Experimental results show that our approach outperforms video synthesis models trained on terrestrial datasets, achieving superior visual fidelity and 3D structural consistency.
