Table of Contents
Fetching ...

AerialGo: Walking-through City View Generation from Aerial Perspectives

Fuqiang Zhao, Yijing Guo, Siyuan Yang, Xi Chen, Luo Wang, Lan Xu, Yingliang Zhang, Yujiao Shi, Jingyi Yu

TL;DR

AerialGo addresses the privacy and scalability bottlenecks of city-scale 3D reconstruction by generating realistic ground-view images from aerial data using a multi-view diffusion framework conditioned on aerial references and 3D priors. It introduces a diffusion-based Aerial2Ground generator and integrates generated priors into 3DGS backbones, yielding improved ground-view fidelity and structural coherence. The paper also presents the AerialGo dataset, a large-scale collection of 3.45 million aerial and ground-view images across 134 km^2 with depth and camera annotations to enable training and evaluation. Across extensive experiments, AerialGo demonstrates superior ground-level realism and occlusion handling, offering a privacy-preserving, scalable approach for city-scale 3D reconstruction and walk-through rendering.

Abstract

High-quality 3D urban reconstruction is essential for applications in urban planning, navigation, and AR/VR. However, capturing detailed ground-level data across cities is both labor-intensive and raises significant privacy concerns related to sensitive information, such as vehicle plates, faces, and other personal identifiers. To address these challenges, we propose AerialGo, a novel framework that generates realistic walking-through city views from aerial images, leveraging multi-view diffusion models to achieve scalable, photorealistic urban reconstructions without direct ground-level data collection. By conditioning ground-view synthesis on accessible aerial data, AerialGo bypasses the privacy risks inherent in ground-level imagery. To support the model training, we introduce AerialGo dataset, a large-scale dataset containing diverse aerial and ground-view images, paired with camera and depth information, designed to support generative urban reconstruction. Experiments show that AerialGo significantly enhances ground-level realism and structural coherence, providing a privacy-conscious, scalable solution for city-scale 3D modeling.

AerialGo: Walking-through City View Generation from Aerial Perspectives

TL;DR

AerialGo addresses the privacy and scalability bottlenecks of city-scale 3D reconstruction by generating realistic ground-view images from aerial data using a multi-view diffusion framework conditioned on aerial references and 3D priors. It introduces a diffusion-based Aerial2Ground generator and integrates generated priors into 3DGS backbones, yielding improved ground-view fidelity and structural coherence. The paper also presents the AerialGo dataset, a large-scale collection of 3.45 million aerial and ground-view images across 134 km^2 with depth and camera annotations to enable training and evaluation. Across extensive experiments, AerialGo demonstrates superior ground-level realism and occlusion handling, offering a privacy-preserving, scalable approach for city-scale 3D reconstruction and walk-through rendering.

Abstract

High-quality 3D urban reconstruction is essential for applications in urban planning, navigation, and AR/VR. However, capturing detailed ground-level data across cities is both labor-intensive and raises significant privacy concerns related to sensitive information, such as vehicle plates, faces, and other personal identifiers. To address these challenges, we propose AerialGo, a novel framework that generates realistic walking-through city views from aerial images, leveraging multi-view diffusion models to achieve scalable, photorealistic urban reconstructions without direct ground-level data collection. By conditioning ground-view synthesis on accessible aerial data, AerialGo bypasses the privacy risks inherent in ground-level imagery. To support the model training, we introduce AerialGo dataset, a large-scale dataset containing diverse aerial and ground-view images, paired with camera and depth information, designed to support generative urban reconstruction. Experiments show that AerialGo significantly enhances ground-level realism and structural coherence, providing a privacy-conscious, scalable solution for city-scale 3D modeling.

Paper Structure

This paper contains 13 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overview of the AerialGo dataset and results. (a) The AerialGo dataset is a large-scale, multi-view dataset, encompassing aerial and ground perspectives, and multi-attribute dataset. (b) Leveraging the AerialGo dataset, we introduce the AerialGo method, an innovative multi-view diffusion framework designed to synthesize photorealistic ground-level imagery from aerial observations, enabling enhanced urban scene reconstruction and realistic walkthrough experiences.
  • Figure 2: Overview of the dataset and data collection process. This figure showcases an example of our urban city model, highlighting the block partitioning, the design of aerial and ground trajectories, as well as the dynamic rendering capabilities.
  • Figure 3: Pipeline of the AerialGo method. Starting with a target ground view, we first select reference images from the nearest aerial views and encode them using a pretrained auto-encoder. The diffusion model then processes the encoded aerial features along with random noise at the ground view, passing the adapted features through 3D self-attention layers. Additionally, CLIP embeddings of the ground-view point cloud render are integrated via cross-attention layers to enhance structural consistency in the generated views. The resulting priors contribute to improved 3D urban reconstruction quality, especially at ground level.
  • Figure 4: Qualitative comparison of 3D reconstruction methods with or without our generated ground view priors. * notes that the method is implemented by ourselves.
  • Figure 5: Qualitative comparison of generative NVS methods on AerialGo and MatrixCity Dataset. Compared with MotionCtrl wang2024motionctrl, LucidDreamer chung2023luciddreamer, and ViewCraft yu2024viewcrafter, our results aligns well with the target image.
  • ...and 2 more figures