SatDreamer360: Multiview-Consistent Generation of Ground-Level Scenes from Satellite Imagery
Xianghui Ze, Beiyi Zhu, Zhenbo Song, Jianfeng Lu, Yujiao Shi
TL;DR
SatDreamer360 tackles generating continuous, multiview ground-level panoramas from a single satellite image along a predefined trajectory. It combines a tri-plane scene representation with ray-guided cross-view feature conditioning and an epipolar-constrained attention mechanism to enforce geometry-aware, temporally coherent outputs, validated on the new VIGOR++ dataset. Key contributions include the unified diffusion-based framework, the ray-based attention for view-specific feature retrieval, the panoramic inter-frame alignment strategy, and the large-scale VIGOR++ benchmark. The approach advances practical cross-view synthesis for simulation, autonomous navigation, and digital twin applications by achieving improved satellite-to-ground alignment and multiview consistency across diverse urban and rural scenes.
Abstract
Generating multiview-consistent $360^\circ$ ground-level scenes from satellite imagery is a challenging task with broad applications in simulation, autonomous navigation, and digital twin cities. Existing approaches primarily focus on synthesizing individual ground-view panoramas, often relying on auxiliary inputs like height maps or handcrafted projections, and struggle to produce multiview consistent sequences. In this paper, we propose SatDreamer360, a framework that generates geometrically consistent multi-view ground-level panoramas from a single satellite image, given a predefined pose trajectory. To address the large viewpoint discrepancy between ground and satellite images, we adopt a triplane representation to encode scene features and design a ray-based pixel attention mechanism that retrieves view-specific features from the triplane. To maintain multi-frame consistency, we introduce a panoramic epipolar-constrained attention module that aligns features across frames based on known relative poses. To support the evaluation, we introduce {VIGOR++}, a large-scale dataset for generating multi-view ground panoramas from a satellite image, by augmenting the original VIGOR dataset with more ground-view images and their pose annotations. Experiments show that SatDreamer360 outperforms existing methods in both satellite-to-ground alignment and multiview consistency.
