Table of Contents
Fetching ...

Leveraging BEV Paradigm for Ground-to-Aerial Image Synthesis

Junyan Ye, Jun He, Weijia Li, Zhutao Lv, Yi Lin, Jinhua Yu, Haote Yang, Conghui He

TL;DR

SkyDiffusion tackles cross-view ground-to-aerial image synthesis by bridging viewpoint gaps with a Curved-BEV transformation and steering a BEV-guided diffusion model to generate content-consistent aerial imagery. It introduces Ground2Aerial-3 (G2A-3), a diverse dataset for disaster, low-altitude UAV, and historical imagery synthesis, and demonstrates state-of-the-art results on CVUSA, CVACT, VIGOR-Chicago, and G2A-3 across multiple tasks. The approach yields significant improvements in realism (FID) and content fidelity (SSIM) while effectively mitigating occlusions in dense urban scenes, highlighting practical utility for disaster response and historical imagery analysis.

Abstract

Ground-to-aerial image synthesis focuses on generating realistic aerial images from corresponding ground street view images while maintaining consistent content layout, simulating a top-down view. The significant viewpoint difference leads to domain gaps between views, and dense urban scenes limit the visible range of street views, making this cross-view generation task particularly challenging. In this paper, we introduce SkyDiffusion, a novel cross-view generation method for synthesizing aerial images from street view images, utilizing a diffusion model and the Bird's-Eye View (BEV) paradigm. The Curved-BEV method in SkyDiffusion converts street-view images into a BEV perspective, effectively bridging the domain gap, and employs a "multi-to-one" mapping strategy to address occlusion issues in dense urban scenes. Next, SkyDiffusion designed a BEV-guided diffusion model to generate content-consistent and realistic aerial images. Additionally, we introduce a novel dataset, Ground2Aerial-3, designed for diverse ground-to-aerial image synthesis applications, including disaster scene aerial synthesis, low-altitude UAV image synthesis, and historical high-resolution satellite image synthesis tasks. Experimental results demonstrate that SkyDiffusion outperforms state-of-the-art methods on cross-view datasets across natural (CVUSA), suburban (CVACT), urban (VIGOR-Chicago), and various application scenarios (G2A-3), achieving realistic and content-consistent aerial image generation. The code, datasets and more information of this work can be found at https://opendatalab.github.io/skydiffusion/ .

Leveraging BEV Paradigm for Ground-to-Aerial Image Synthesis

TL;DR

SkyDiffusion tackles cross-view ground-to-aerial image synthesis by bridging viewpoint gaps with a Curved-BEV transformation and steering a BEV-guided diffusion model to generate content-consistent aerial imagery. It introduces Ground2Aerial-3 (G2A-3), a diverse dataset for disaster, low-altitude UAV, and historical imagery synthesis, and demonstrates state-of-the-art results on CVUSA, CVACT, VIGOR-Chicago, and G2A-3 across multiple tasks. The approach yields significant improvements in realism (FID) and content fidelity (SSIM) while effectively mitigating occlusions in dense urban scenes, highlighting practical utility for disaster response and historical imagery analysis.

Abstract

Ground-to-aerial image synthesis focuses on generating realistic aerial images from corresponding ground street view images while maintaining consistent content layout, simulating a top-down view. The significant viewpoint difference leads to domain gaps between views, and dense urban scenes limit the visible range of street views, making this cross-view generation task particularly challenging. In this paper, we introduce SkyDiffusion, a novel cross-view generation method for synthesizing aerial images from street view images, utilizing a diffusion model and the Bird's-Eye View (BEV) paradigm. The Curved-BEV method in SkyDiffusion converts street-view images into a BEV perspective, effectively bridging the domain gap, and employs a "multi-to-one" mapping strategy to address occlusion issues in dense urban scenes. Next, SkyDiffusion designed a BEV-guided diffusion model to generate content-consistent and realistic aerial images. Additionally, we introduce a novel dataset, Ground2Aerial-3, designed for diverse ground-to-aerial image synthesis applications, including disaster scene aerial synthesis, low-altitude UAV image synthesis, and historical high-resolution satellite image synthesis tasks. Experimental results demonstrate that SkyDiffusion outperforms state-of-the-art methods on cross-view datasets across natural (CVUSA), suburban (CVACT), urban (VIGOR-Chicago), and various application scenarios (G2A-3), achieving realistic and content-consistent aerial image generation. The code, datasets and more information of this work can be found at https://opendatalab.github.io/skydiffusion/ .
Paper Structure (17 sections, 6 equations, 7 figures, 3 tables)

This paper contains 17 sections, 6 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Illustration of the cross-view image synthesis task. (a) Challenges of cross-view domain gaps; (b) Challenges of occlusion in dense scenes; (c) Comparing our ground-to-aerial image synthesis method with existing cross-view synthesis methods.
  • Figure 2: Overview of the proposed SkyDiffusion framework. It include the curved BEV transformation and BEV-controlled diffusion model. The lower parts present the results of One-to-One and Multi-to-One BEV transformations, respectively.
  • Figure 3: Schematic of the curved BEV transformation. It illustrates the mapping of two points on the BEV plane to the top and bottom of the street-view image during transformation.
  • Figure 4: Illustration of the Ground2Aerial-3 dataset.
  • Figure 5: Qualitative comparison of different methods synthesis results on three datasets.
  • ...and 2 more figures