Table of Contents
Fetching ...

Skyeyes: Ground Roaming using Aerial View Images

Zhiyuan Gao, Wenbin Teng, Gonglin Chen, Jinsen Wu, Ningli Xu, Rongjun Qin, Andrew Feng, Yajie Zhao

TL;DR

Skyeyes addresses the challenge of generating photorealistic, temporally coherent ground-view sequences from aerial imagery for large-scale outdoor scenes. It introduces a three-component pipeline: (1) SuGaR-based 3D Gaussian Splatting to produce geometry-aware ground-view priors, (2) an appearance control module using latent diffusion with ControlNet to render realistic street views, and (3) a view consistency module that enforces spatial-temporal coherence across frames via a first-frame conditioned diffusion process. The framework is trained on a large synthetic geo-aligned dataset built from CARLA and CitySample, and its quantitative and qualitative results show significant improvements in video coherence (FVD/KVD) while maintaining competitive image quality compared with state-of-the-art baselines. This work enables scalable, cross-view synthesis for applications in autonomous driving and gaming, with practical impact in generating realistic, consistent ground scenes from aerial perspectives. Limitations include generalization to real-world data, which the authors plan to address by diversifying training textures and lighting in future work.

Abstract

Integrating aerial imagery-based scene generation into applications like autonomous driving and gaming enhances realism in 3D environments, but challenges remain in creating detailed content for occluded areas and ensuring real-time, consistent rendering. In this paper, we introduce Skyeyes, a novel framework that can generate photorealistic sequences of ground view images using only aerial view inputs, thereby creating a ground roaming experience. More specifically, we combine a 3D representation with a view consistent generation model, which ensures coherence between generated images. This method allows for the creation of geometrically consistent ground view images, even with large view gaps. The images maintain improved spatial-temporal coherence and realism, enhancing scene comprehension and visualization from aerial perspectives. To the best of our knowledge, there are no publicly available datasets that contain pairwise geo-aligned aerial and ground view imagery. Therefore, we build a large, synthetic, and geo-aligned dataset using Unreal Engine. Both qualitative and quantitative analyses on this synthetic dataset display superior results compared to other leading synthesis approaches. See the project page for more results: https://chaoren2357.github.io/website-skyeyes/.

Skyeyes: Ground Roaming using Aerial View Images

TL;DR

Skyeyes addresses the challenge of generating photorealistic, temporally coherent ground-view sequences from aerial imagery for large-scale outdoor scenes. It introduces a three-component pipeline: (1) SuGaR-based 3D Gaussian Splatting to produce geometry-aware ground-view priors, (2) an appearance control module using latent diffusion with ControlNet to render realistic street views, and (3) a view consistency module that enforces spatial-temporal coherence across frames via a first-frame conditioned diffusion process. The framework is trained on a large synthetic geo-aligned dataset built from CARLA and CitySample, and its quantitative and qualitative results show significant improvements in video coherence (FVD/KVD) while maintaining competitive image quality compared with state-of-the-art baselines. This work enables scalable, cross-view synthesis for applications in autonomous driving and gaming, with practical impact in generating realistic, consistent ground scenes from aerial perspectives. Limitations include generalization to real-world data, which the authors plan to address by diversifying training textures and lighting in future work.

Abstract

Integrating aerial imagery-based scene generation into applications like autonomous driving and gaming enhances realism in 3D environments, but challenges remain in creating detailed content for occluded areas and ensuring real-time, consistent rendering. In this paper, we introduce Skyeyes, a novel framework that can generate photorealistic sequences of ground view images using only aerial view inputs, thereby creating a ground roaming experience. More specifically, we combine a 3D representation with a view consistent generation model, which ensures coherence between generated images. This method allows for the creation of geometrically consistent ground view images, even with large view gaps. The images maintain improved spatial-temporal coherence and realism, enhancing scene comprehension and visualization from aerial perspectives. To the best of our knowledge, there are no publicly available datasets that contain pairwise geo-aligned aerial and ground view imagery. Therefore, we build a large, synthetic, and geo-aligned dataset using Unreal Engine. Both qualitative and quantitative analyses on this synthetic dataset display superior results compared to other leading synthesis approaches. See the project page for more results: https://chaoren2357.github.io/website-skyeyes/.
Paper Structure (28 sections, 10 equations, 7 figures, 2 tables)

This paper contains 28 sections, 10 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: We proposed SkyEyes, a novel framework for efficient aerial-to-ground cross-view synthesis, transforming aerial imagery into realistic street view image sequence. This first-of-its-kind method for large-scale outdoor scenes combines 3D Gaussian Splatting with diffusion models to identify data gaps. Our constrained optimization strategy and View Consistent Module enable us to achieve images from entirely different perspectives compared to the input imagery, significantly enhancing the quality of ground-level view synthesis.
  • Figure 2: (a) Overview of Skyeyes Pipeline: Our approach commences with the utilization of SuGaR guedon2023sugar. This stage involves processing aerial images and camera poses to train the model for generating ground view priors. After that, we train an appearance control module to generate photo-realistic street images (b) Spatial-Temporal Self-Attention Module: In the final stage, our view consistency module integrates temporal modeling to ensure spatial and temporal coherence across different views. This module, akin to a spatial-temporal self-attention mechanism, guarantees the consistency and continuity of the scene's depiction across various perspectives. At inference time, given a sequence of ground view priors rendered from SuGaR guedon2023sugar, our view consistency module can generate photo-realistic and temporal consistent ground view sequence by denoising from pure Gaussian noise.
  • Figure 3: Comprehensive visual representation of the data extraction process from CARLA and City Sample Project.
  • Figure 4: Qualitative Results. Conditioned on aerial images (leftmost column), our method synthesizes realistic and view-consistent ground view sequences. The first two rows are from the CitySample dataset, and the last two from the CARLA dataset. We strongly recommend checking the supplementary material for more results.
  • Figure 5: Qualitative Comparisons. We compare Skyeyes with other SOTA methods for ground view generation. Unlike tasks that require matching ground truth, our task focuses on generating visually plausible images with continuous textures. All methods were evaluated under the same conditions, and Skyeyes consistently delivers superior visual quality.
  • ...and 2 more figures