Table of Contents
Fetching ...

Satellite to GroundScape -- Large-scale Consistent Ground View Generation from Satellite Views

Ningli Xu, Rongjun Qin

TL;DR

Sat2GroundScape tackles cross-view ground-view synthesis from satellite imagery by addressing large viewpoint and resolution gaps that cause multi-view inconsistencies. It introduces a fixed latent diffusion model with satellite-guided denoising to preserve scene layout and satellite-temporal denoising to ensure temporal consistency across multiple ground views, along with a large-scale Sat2GroundScape dataset containing over 25k panoramic and 100k perspective satellite-ground pairs. The method achieves superior perceptual and temporal metrics compared with state-of-the-art baselines, delivering photorealistic and coherent multi-view ground scenes from satellite inputs. This approach enables scalable ground-scene generation for urban modeling, gaming, and simulation, leveraging large satellite datasets to produce realistic, consistent ground views.

Abstract

Generating consistent ground-view images from satellite imagery is challenging, primarily due to the large discrepancies in viewing angles and resolution between satellite and ground-level domains. Previous efforts mainly concentrated on single-view generation, often resulting in inconsistencies across neighboring ground views. In this work, we propose a novel cross-view synthesis approach designed to overcome these challenges by ensuring consistency across ground-view images generated from satellite views. Our method, based on a fixed latent diffusion model, introduces two conditioning modules: satellite-guided denoising, which extracts high-level scene layout to guide the denoising process, and satellite-temporal denoising, which captures camera motion to maintain consistency across multiple generated views. We further contribute a large-scale satellite-ground dataset containing over 100,000 perspective pairs to facilitate extensive ground scene or video generation. Experimental results demonstrate that our approach outperforms existing methods on perceptual and temporal metrics, achieving high photorealism and consistency in multi-view outputs.

Satellite to GroundScape -- Large-scale Consistent Ground View Generation from Satellite Views

TL;DR

Sat2GroundScape tackles cross-view ground-view synthesis from satellite imagery by addressing large viewpoint and resolution gaps that cause multi-view inconsistencies. It introduces a fixed latent diffusion model with satellite-guided denoising to preserve scene layout and satellite-temporal denoising to ensure temporal consistency across multiple ground views, along with a large-scale Sat2GroundScape dataset containing over 25k panoramic and 100k perspective satellite-ground pairs. The method achieves superior perceptual and temporal metrics compared with state-of-the-art baselines, delivering photorealistic and coherent multi-view ground scenes from satellite inputs. This approach enables scalable ground-scene generation for urban modeling, gaming, and simulation, leveraging large satellite datasets to produce realistic, consistent ground views.

Abstract

Generating consistent ground-view images from satellite imagery is challenging, primarily due to the large discrepancies in viewing angles and resolution between satellite and ground-level domains. Previous efforts mainly concentrated on single-view generation, often resulting in inconsistencies across neighboring ground views. In this work, we propose a novel cross-view synthesis approach designed to overcome these challenges by ensuring consistency across ground-view images generated from satellite views. Our method, based on a fixed latent diffusion model, introduces two conditioning modules: satellite-guided denoising, which extracts high-level scene layout to guide the denoising process, and satellite-temporal denoising, which captures camera motion to maintain consistency across multiple generated views. We further contribute a large-scale satellite-ground dataset containing over 100,000 perspective pairs to facilitate extensive ground scene or video generation. Experimental results demonstrate that our approach outperforms existing methods on perceptual and temporal metrics, achieving high photorealism and consistency in multi-view outputs.

Paper Structure

This paper contains 17 sections, 5 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Ground views generated by Sat2GroundScape. Using satellite views as input, Sat2GroundScape generates a sequence of ground views that exhibit photorealistic quality and maintain consistent ground appearances across different perspectives.
  • Figure 2: Overview pipeline of Sat2GroundScape. The satellite appearance is initially projected onto the ground level based on the estimated satellite geometry. Satellite-Guided Denoising is then introduced to guide the latent diffusion model (LDM) in generating individual ground views that preserve the original scene layouts. Satellite-Temporal Denoising is proposed to further ensure consistency across multiple generated views. Input/output are marked as red.
  • Figure 3: Satellite-Guided Denoising. Conditioning on a given satellite view, a random noisy latent feature $\boldsymbol{z}_T$ is iteratively denoised to finally become the corresponding ground view latent feature $\boldsymbol{z}_0$ instead of other randomly generated ground views. We extract the high-level satellite features and guide the standard LDM to perform denoising. Note that $\boldsymbol{z}_i$ are in latent spaces, we illustrate these latent features with corresponding images in pixel space.
  • Figure 4: Satellite-Temporal Denoising takes a sequence of ground-view satellite appearance $\{\boldsymbol{I}_g^i\}$ as input and generates the consistent ground views $\{\boldsymbol{x}^i\}$. It first generates the initial ground view $x^{init}$ and concatenates it to the initial noise as the input to the spatial-temporal LDM. Additionally, $\{\boldsymbol{I}_g^i\}$ are encoded as camera motion features to guide the denoising process. Red variables are the input/output for our method.
  • Figure 5: Sat2GroundScape dataset. Our dataset provides accurately aligned satellite and ground data, containing appearance, depth, and camera pose information, in both panoramic (over 25,000 pairs) and perspective formats (over 100,000 pairs). Each ground panorama is associated with four perspective views, labeled as "LF, LR, RF, RR" (left forward, left rear, right forward, and right rear). Furthermore, we include a dense ground collection (marked as "red dots") with intervals of 3 to 10 meters between points, supporting large-scale scene and video generation tasks.
  • ...and 3 more figures