Satellite to GroundScape -- Large-scale Consistent Ground View Generation from Satellite Views
Ningli Xu, Rongjun Qin
TL;DR
Sat2GroundScape tackles cross-view ground-view synthesis from satellite imagery by addressing large viewpoint and resolution gaps that cause multi-view inconsistencies. It introduces a fixed latent diffusion model with satellite-guided denoising to preserve scene layout and satellite-temporal denoising to ensure temporal consistency across multiple ground views, along with a large-scale Sat2GroundScape dataset containing over 25k panoramic and 100k perspective satellite-ground pairs. The method achieves superior perceptual and temporal metrics compared with state-of-the-art baselines, delivering photorealistic and coherent multi-view ground scenes from satellite inputs. This approach enables scalable ground-scene generation for urban modeling, gaming, and simulation, leveraging large satellite datasets to produce realistic, consistent ground views.
Abstract
Generating consistent ground-view images from satellite imagery is challenging, primarily due to the large discrepancies in viewing angles and resolution between satellite and ground-level domains. Previous efforts mainly concentrated on single-view generation, often resulting in inconsistencies across neighboring ground views. In this work, we propose a novel cross-view synthesis approach designed to overcome these challenges by ensuring consistency across ground-view images generated from satellite views. Our method, based on a fixed latent diffusion model, introduces two conditioning modules: satellite-guided denoising, which extracts high-level scene layout to guide the denoising process, and satellite-temporal denoising, which captures camera motion to maintain consistency across multiple generated views. We further contribute a large-scale satellite-ground dataset containing over 100,000 perspective pairs to facilitate extensive ground scene or video generation. Experimental results demonstrate that our approach outperforms existing methods on perceptual and temporal metrics, achieving high photorealism and consistency in multi-view outputs.
