Top2Ground: A Height-Aware Dual Conditioning Diffusion Model for Robust Aerial-to-Ground View Generation
Jae Joong Lee, Bedrich Benes
TL;DR
Top2Ground tackles cross-view ground-view synthesis by generating ground-level images from aerial inputs using a diffusion model conditioned on height-aware geometry and semantic context. It introduces height-aware dual conditioning by fusing VAE-based spatial features from the aerial image and its height map with CLIP-based semantic embeddings, enabling direct synthesis without 3D intermediates. On CVUSA, CVACT, and Auto Arborist, it achieves state-of-the-art performance with notable gains in SSIM and KID, demonstrating robust generalization across wide and narrow FOVs. The approach provides a scalable, efficient foundation for cross-view generation with potential extensions to other modalities and temporal consistency.
Abstract
Generating ground-level images from aerial views is a challenging task due to extreme viewpoint disparity, occlusions, and a limited field of view. We introduce Top2Ground, a novel diffusion-based method that directly generates photorealistic ground-view images from aerial input images without relying on intermediate representations such as depth maps or 3D voxels. Specifically, we condition the denoising process on a joint representation of VAE-encoded spatial features (derived from aerial RGB images and an estimated height map) and CLIP-based semantic embeddings. This design ensures the generation is both geometrically constrained by the scene's 3D structure and semantically consistent with its content. We evaluate Top2Ground on three diverse datasets: CVUSA, CVACT, and the Auto Arborist. Our approach shows 7.3% average improvement in SSIM across three benchmark datasets, showing Top2Ground can robustly handle both wide and narrow fields of view, highlighting its strong generalization capabilities.
