Skyeyes: Ground Roaming using Aerial View Images
Zhiyuan Gao, Wenbin Teng, Gonglin Chen, Jinsen Wu, Ningli Xu, Rongjun Qin, Andrew Feng, Yajie Zhao
TL;DR
Skyeyes addresses the challenge of generating photorealistic, temporally coherent ground-view sequences from aerial imagery for large-scale outdoor scenes. It introduces a three-component pipeline: (1) SuGaR-based 3D Gaussian Splatting to produce geometry-aware ground-view priors, (2) an appearance control module using latent diffusion with ControlNet to render realistic street views, and (3) a view consistency module that enforces spatial-temporal coherence across frames via a first-frame conditioned diffusion process. The framework is trained on a large synthetic geo-aligned dataset built from CARLA and CitySample, and its quantitative and qualitative results show significant improvements in video coherence (FVD/KVD) while maintaining competitive image quality compared with state-of-the-art baselines. This work enables scalable, cross-view synthesis for applications in autonomous driving and gaming, with practical impact in generating realistic, consistent ground scenes from aerial perspectives. Limitations include generalization to real-world data, which the authors plan to address by diversifying training textures and lighting in future work.
Abstract
Integrating aerial imagery-based scene generation into applications like autonomous driving and gaming enhances realism in 3D environments, but challenges remain in creating detailed content for occluded areas and ensuring real-time, consistent rendering. In this paper, we introduce Skyeyes, a novel framework that can generate photorealistic sequences of ground view images using only aerial view inputs, thereby creating a ground roaming experience. More specifically, we combine a 3D representation with a view consistent generation model, which ensures coherence between generated images. This method allows for the creation of geometrically consistent ground view images, even with large view gaps. The images maintain improved spatial-temporal coherence and realism, enhancing scene comprehension and visualization from aerial perspectives. To the best of our knowledge, there are no publicly available datasets that contain pairwise geo-aligned aerial and ground view imagery. Therefore, we build a large, synthetic, and geo-aligned dataset using Unreal Engine. Both qualitative and quantitative analyses on this synthetic dataset display superior results compared to other leading synthesis approaches. See the project page for more results: https://chaoren2357.github.io/website-skyeyes/.
