Table of Contents
Fetching ...

CrossViewDiff: A Cross-View Diffusion Model for Satellite-to-Street View Synthesis

Weijia Li, Jun He, Junyan Ye, Huaping Zhong, Zhimeng Zheng, Zilong Huang, Dahua Lin, Conghui He

TL;DR

CrossViewDiff tackles satellite-to-street view synthesis by introducing structure and texture controls derived from satellite data and integrating them through a cross-view attention-guided diffusion process. The method constructs 3D voxel-based scene structure and a cross-view texture mapping to provide local texture guidance, then fuses this information via an enhanced cross-view attention within a latent diffusion framework. A GPT-4o-based evaluation protocol assesses consistency, realism, and perceptual quality, showing strong alignment with human judgments. Experiments across CVUSA, CVACT, and OmniCity demonstrate superior performance over state-of-the-art methods, with ablations and multimodal data analyses highlighting the value of combining satellite-derived structure/texture cues with diffusion-based generation.

Abstract

Satellite-to-street view synthesis aims at generating a realistic street-view image from its corresponding satellite-view image. Although stable diffusion models have exhibit remarkable performance in a variety of image generation applications, their reliance on similar-view inputs to control the generated structure or texture restricts their application to the challenging cross-view synthesis task. In this work, we propose CrossViewDiff, a cross-view diffusion model for satellite-to-street view synthesis. To address the challenges posed by the large discrepancy across views, we design the satellite scene structure estimation and cross-view texture mapping modules to construct the structural and textural controls for street-view image synthesis. We further design a cross-view control guided denoising process that incorporates the above controls via an enhanced cross-view attention module. To achieve a more comprehensive evaluation of the synthesis results, we additionally design a GPT-based scoring method as a supplement to standard evaluation metrics. We also explore the effect of different data sources (e.g., text, maps, building heights, and multi-temporal satellite imagery) on this task. Results on three public cross-view datasets show that CrossViewDiff outperforms current state-of-the-art on both standard and GPT-based evaluation metrics, generating high-quality street-view panoramas with more realistic structures and textures across rural, suburban, and urban scenes. The code and models of this work will be released at https://opendatalab.github.io/CrossViewDiff/.

CrossViewDiff: A Cross-View Diffusion Model for Satellite-to-Street View Synthesis

TL;DR

CrossViewDiff tackles satellite-to-street view synthesis by introducing structure and texture controls derived from satellite data and integrating them through a cross-view attention-guided diffusion process. The method constructs 3D voxel-based scene structure and a cross-view texture mapping to provide local texture guidance, then fuses this information via an enhanced cross-view attention within a latent diffusion framework. A GPT-4o-based evaluation protocol assesses consistency, realism, and perceptual quality, showing strong alignment with human judgments. Experiments across CVUSA, CVACT, and OmniCity demonstrate superior performance over state-of-the-art methods, with ablations and multimodal data analyses highlighting the value of combining satellite-derived structure/texture cues with diffusion-based generation.

Abstract

Satellite-to-street view synthesis aims at generating a realistic street-view image from its corresponding satellite-view image. Although stable diffusion models have exhibit remarkable performance in a variety of image generation applications, their reliance on similar-view inputs to control the generated structure or texture restricts their application to the challenging cross-view synthesis task. In this work, we propose CrossViewDiff, a cross-view diffusion model for satellite-to-street view synthesis. To address the challenges posed by the large discrepancy across views, we design the satellite scene structure estimation and cross-view texture mapping modules to construct the structural and textural controls for street-view image synthesis. We further design a cross-view control guided denoising process that incorporates the above controls via an enhanced cross-view attention module. To achieve a more comprehensive evaluation of the synthesis results, we additionally design a GPT-based scoring method as a supplement to standard evaluation metrics. We also explore the effect of different data sources (e.g., text, maps, building heights, and multi-temporal satellite imagery) on this task. Results on three public cross-view datasets show that CrossViewDiff outperforms current state-of-the-art on both standard and GPT-based evaluation metrics, generating high-quality street-view panoramas with more realistic structures and textures across rural, suburban, and urban scenes. The code and models of this work will be released at https://opendatalab.github.io/CrossViewDiff/.
Paper Structure (22 sections, 8 equations, 12 figures, 7 tables)

This paper contains 22 sections, 8 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Illustration of the satellite-to-street view synthesis task. (a) In cross-view scenarios, the satellite view and street view differ significantly, with limited overlapping information, posing a serious challenge to the satellite-to-street view synthesis task. (b) Compared with existing methods using GANs (e.g., Sat2Density Sat2Density) or diffusion models (e.g., ControlNet control), CrossViewDiff is capable of synthesizing more realistic street-view images with better perceptual quality and consistency with Ground Truth.
  • Figure 2: Overview of our proposed CrossViewDiff. First, we create 3D voxels based on a depth estimation method as intermediaries of information across different viewpoints. Subsequently, based on the satellite images and 3D voxels, we establish structural and textural controls for street view synthesis via satellite scene structure estimation and cross-view texture mapping, respectively. Finally, we integrate the above cross-view control information via an enhanced cross-view attention mechanism, guiding the denoising process to synthesize street-view images.
  • Figure 3: The overall process for automated evaluation using GPT-4o. Instructions are meta-prompts that include a task description, scoring criteria, scoring range, and scoring examples. Then we use a GPT-4o as Evaluator A to provide initial scores and reasons based on the input prompts and image samples. Finally, the scores are combined with the image samples for a secondary evaluation by another GPT-4o as Inspector B, who assesses the score's appropriateness and determines the final score.
  • Figure 3: Average similarity between human user ratings and GPT ratings.
  • Figure 4: Qualitative comparison of synthesis results on CVUSA Zhai_2017_CVPR, CVACT CVACT and OmniCity Li_2023_CVPR, respectively. The comparison includes the synthesis results of Sat2Density Sat2Density, ControlNet control, Instr-p2p insd, and our method. The results indicate that our method generates street views that are more realistic, consistent, and of higher quality compared to other methods.
  • ...and 7 more figures