CrossViewDiff: A Cross-View Diffusion Model for Satellite-to-Street View Synthesis
Weijia Li, Jun He, Junyan Ye, Huaping Zhong, Zhimeng Zheng, Zilong Huang, Dahua Lin, Conghui He
TL;DR
CrossViewDiff tackles satellite-to-street view synthesis by introducing structure and texture controls derived from satellite data and integrating them through a cross-view attention-guided diffusion process. The method constructs 3D voxel-based scene structure and a cross-view texture mapping to provide local texture guidance, then fuses this information via an enhanced cross-view attention within a latent diffusion framework. A GPT-4o-based evaluation protocol assesses consistency, realism, and perceptual quality, showing strong alignment with human judgments. Experiments across CVUSA, CVACT, and OmniCity demonstrate superior performance over state-of-the-art methods, with ablations and multimodal data analyses highlighting the value of combining satellite-derived structure/texture cues with diffusion-based generation.
Abstract
Satellite-to-street view synthesis aims at generating a realistic street-view image from its corresponding satellite-view image. Although stable diffusion models have exhibit remarkable performance in a variety of image generation applications, their reliance on similar-view inputs to control the generated structure or texture restricts their application to the challenging cross-view synthesis task. In this work, we propose CrossViewDiff, a cross-view diffusion model for satellite-to-street view synthesis. To address the challenges posed by the large discrepancy across views, we design the satellite scene structure estimation and cross-view texture mapping modules to construct the structural and textural controls for street-view image synthesis. We further design a cross-view control guided denoising process that incorporates the above controls via an enhanced cross-view attention module. To achieve a more comprehensive evaluation of the synthesis results, we additionally design a GPT-based scoring method as a supplement to standard evaluation metrics. We also explore the effect of different data sources (e.g., text, maps, building heights, and multi-temporal satellite imagery) on this task. Results on three public cross-view datasets show that CrossViewDiff outperforms current state-of-the-art on both standard and GPT-based evaluation metrics, generating high-quality street-view panoramas with more realistic structures and textures across rural, suburban, and urban scenes. The code and models of this work will be released at https://opendatalab.github.io/CrossViewDiff/.
