Table of Contents
Fetching ...

Martian World Model: Controllable Video Synthesis with Physically Accurate 3D Reconstructions

Longfei Li, Zhiwen Fan, Wenyan Cong, Xinhang Liu, Yuyang Yin, Matt Foutter, Panwang Pan, Chenyu You, Yue Wang, Zhangyang Wang, Yao Zhao, Marco Pavone, Yunchao Wei

TL;DR

The paper tackles the scarcity of Martian data and domain mismatch by introducing M3arsSynth, a data engine that reconstructs metric-scale 3D Martian environments from NASA stereo imagery, and MarsGen, a controllable video generator trained on that data. Using a metric-aware initialization and 3D Gaussian Splatting, the approach yields photorealistic, 3D-consistent Martian videos conditioned on initial frames, trajectories, or text prompts. Experiments show MarsGen outperforms Earth-trained baselines in visual fidelity and geometric coherence, enabling realistic mission simulations for navigation, planning, and robotic training. The work provides a scalable, multimodal Martian dataset and a 3D-aware synthesis framework, with broader implications for planetary robotics and simulation, alongside considerations for misuse and resource demands.

Abstract

Synthesizing realistic Martian landscape videos is crucial for mission rehearsal and robotic simulation. However, this task poses unique challenges due to the scarcity of high-quality Martian data and the significant domain gap between Martian and terrestrial imagery. To address these challenges, we propose a holistic solution composed of two key components: 1) A data curation pipeline Multimodal Mars Synthesis (M3arsSynth), which reconstructs 3D Martian environments from real stereo navigation images, sourced from NASA's Planetary Data System (PDS), and renders high-fidelity multiview 3D video sequences. 2) A Martian terrain video generator, MarsGen, which synthesizes novel videos visually realistic and geometrically consistent with the 3D structure encoded in the data. Our M3arsSynth engine spans a wide range of Martian terrains and acquisition dates, enabling the generation of physically accurate 3D surface models at metric-scale resolution. MarsGen, fine-tuned on M3arsSynth data, synthesizes videos conditioned on an initial image frame and, optionally, camera trajectories or textual prompts, allowing for video generation in novel environments. Experimental results show that our approach outperforms video synthesis models trained on terrestrial datasets, achieving superior visual fidelity and 3D structural consistency.

Martian World Model: Controllable Video Synthesis with Physically Accurate 3D Reconstructions

TL;DR

The paper tackles the scarcity of Martian data and domain mismatch by introducing M3arsSynth, a data engine that reconstructs metric-scale 3D Martian environments from NASA stereo imagery, and MarsGen, a controllable video generator trained on that data. Using a metric-aware initialization and 3D Gaussian Splatting, the approach yields photorealistic, 3D-consistent Martian videos conditioned on initial frames, trajectories, or text prompts. Experiments show MarsGen outperforms Earth-trained baselines in visual fidelity and geometric coherence, enabling realistic mission simulations for navigation, planning, and robotic training. The work provides a scalable, multimodal Martian dataset and a 3D-aware synthesis framework, with broader implications for planetary robotics and simulation, alongside considerations for misuse and resource demands.

Abstract

Synthesizing realistic Martian landscape videos is crucial for mission rehearsal and robotic simulation. However, this task poses unique challenges due to the scarcity of high-quality Martian data and the significant domain gap between Martian and terrestrial imagery. To address these challenges, we propose a holistic solution composed of two key components: 1) A data curation pipeline Multimodal Mars Synthesis (M3arsSynth), which reconstructs 3D Martian environments from real stereo navigation images, sourced from NASA's Planetary Data System (PDS), and renders high-fidelity multiview 3D video sequences. 2) A Martian terrain video generator, MarsGen, which synthesizes novel videos visually realistic and geometrically consistent with the 3D structure encoded in the data. Our M3arsSynth engine spans a wide range of Martian terrains and acquisition dates, enabling the generation of physically accurate 3D surface models at metric-scale resolution. MarsGen, fine-tuned on M3arsSynth data, synthesizes videos conditioned on an initial image frame and, optionally, camera trajectories or textual prompts, allowing for video generation in novel environments. Experimental results show that our approach outperforms video synthesis models trained on terrestrial datasets, achieving superior visual fidelity and 3D structural consistency.

Paper Structure

This paper contains 20 sections, 12 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of the M3arsSynth data engine and MarsGen video generator. The M3arsSynth engine processes NASA stereo navigation imagery into a versatile multimodal Mars dataset comprising video, depth/normal maps (from 3D reconstructions), and text descriptions. These outputs advance Mars scene generation and simulation for mission rehearsal and robotic navigation.
  • Figure 2: Data Filtering Examples for Martian 3D Reconstruction. Image (a) represents a clear, usable Martian terrain view, serving as a quality benchmark. In contrast, images (b)-(h) illustrate common defects that lead to data exclusion for high-quality reconstruction, including: (b) extensive missing data blocks or pixelation/mosaic artifacts (indicative of data corruption or severe compression); (c) significant image blur or out-of-focus areas; (d) scenes with extreme overexposure or harsh lighting conditions; and (e)-(h) views obstructed by spacecraft components.
  • Figure 3: Overview of the M3arsSynth dataset construction and conditional video generation through MarsGen. The red box outlines the data curation pipeline, the green box shows the obtained M3arsSynth dataset, and the blue box details our MarsGen model. We process stereo image pairs using a metric-aware foundation model and solve the Perspective-n-Point (PnP) lepetit2009epnp problem to reconstruct metric-scale 3D Martian scenes. Subsequently, video frames rendered from these scenes, together with text prompts and encoded camera trajectories, are then used to condition a Video Diffusion Transformer, enabling the synthesis of novel and controllable Martian video sequences.
  • Figure 4: Distribution of primary terrain types within the M3arsSynth dataset, showcasing the diversity of Martian environments covered. The left chart indicates the percentage of scenes predominantly featuring each terrain type, with visual examples illustrating the various terrain categories.
  • Figure 5: Qualitative comparison of point cloud reconstruction from a Martian input view. Our M3arsSynth engine (top right) produces a coherent point cloud accurately capturing terrain. In contrast, the VGGT wang2025vggt model (bottom right) exhibits significant misalignment and artifacts.
  • ...and 2 more figures