Table of Contents
Fetching ...

Top2Ground: A Height-Aware Dual Conditioning Diffusion Model for Robust Aerial-to-Ground View Generation

Jae Joong Lee, Bedrich Benes

TL;DR

Top2Ground tackles cross-view ground-view synthesis by generating ground-level images from aerial inputs using a diffusion model conditioned on height-aware geometry and semantic context. It introduces height-aware dual conditioning by fusing VAE-based spatial features from the aerial image and its height map with CLIP-based semantic embeddings, enabling direct synthesis without 3D intermediates. On CVUSA, CVACT, and Auto Arborist, it achieves state-of-the-art performance with notable gains in SSIM and KID, demonstrating robust generalization across wide and narrow FOVs. The approach provides a scalable, efficient foundation for cross-view generation with potential extensions to other modalities and temporal consistency.

Abstract

Generating ground-level images from aerial views is a challenging task due to extreme viewpoint disparity, occlusions, and a limited field of view. We introduce Top2Ground, a novel diffusion-based method that directly generates photorealistic ground-view images from aerial input images without relying on intermediate representations such as depth maps or 3D voxels. Specifically, we condition the denoising process on a joint representation of VAE-encoded spatial features (derived from aerial RGB images and an estimated height map) and CLIP-based semantic embeddings. This design ensures the generation is both geometrically constrained by the scene's 3D structure and semantically consistent with its content. We evaluate Top2Ground on three diverse datasets: CVUSA, CVACT, and the Auto Arborist. Our approach shows 7.3% average improvement in SSIM across three benchmark datasets, showing Top2Ground can robustly handle both wide and narrow fields of view, highlighting its strong generalization capabilities.

Top2Ground: A Height-Aware Dual Conditioning Diffusion Model for Robust Aerial-to-Ground View Generation

TL;DR

Top2Ground tackles cross-view ground-view synthesis by generating ground-level images from aerial inputs using a diffusion model conditioned on height-aware geometry and semantic context. It introduces height-aware dual conditioning by fusing VAE-based spatial features from the aerial image and its height map with CLIP-based semantic embeddings, enabling direct synthesis without 3D intermediates. On CVUSA, CVACT, and Auto Arborist, it achieves state-of-the-art performance with notable gains in SSIM and KID, demonstrating robust generalization across wide and narrow FOVs. The approach provides a scalable, efficient foundation for cross-view generation with potential extensions to other modalities and temporal consistency.

Abstract

Generating ground-level images from aerial views is a challenging task due to extreme viewpoint disparity, occlusions, and a limited field of view. We introduce Top2Ground, a novel diffusion-based method that directly generates photorealistic ground-view images from aerial input images without relying on intermediate representations such as depth maps or 3D voxels. Specifically, we condition the denoising process on a joint representation of VAE-encoded spatial features (derived from aerial RGB images and an estimated height map) and CLIP-based semantic embeddings. This design ensures the generation is both geometrically constrained by the scene's 3D structure and semantically consistent with its content. We evaluate Top2Ground on three diverse datasets: CVUSA, CVACT, and the Auto Arborist. Our approach shows 7.3% average improvement in SSIM across three benchmark datasets, showing Top2Ground can robustly handle both wide and narrow fields of view, highlighting its strong generalization capabilities.

Paper Structure

This paper contains 14 sections, 3 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Top2Ground begins by taking an aerial RGB image, $x$, and generating an estimated height map, $H(x)$. $x$ goes into the pre-trained CLIP $\mathbb{C}$ and $x$ and $H(x)$ go into the pre-trained VAE $\mathbb{V}$ to extract semantic and structural embedding features, which are $\mathbb{C}(x)$ and $\mathbb{V}(x)$ respectively. $\mathbb{V}(x)$ is merged with Gaussian Noise $z_t$ to feed in a latent diffusion model, $f_\theta$. In the diffusion process, cross-attention conditioned on $\mathbb{C}(x)$ is utilized to provide semantic consistency. We apply classifier-free guidance with a scale of 2, the model generates a high-quality RGB ground-level image, $y$.
  • Figure 2: Qualitative comparison of generated ground-level images on the CVUSA dataset. We compare our method with ControlNet (CntrlNet), InstructPix2Pix (Inst P2P), BBDM, and Sat2Density (S2D). Our model better preserves structural layout and semantic coherence, demonstrating improved fidelity and realism over prior approaches.
  • Figure 3: Qualitative comparison of generated ground-level images on the CVACT dataset. We compare our method with ControlNet (CntrlNet), InstructPix2Pix (Inst P2P), BBDM, and Sat2Density (S2D).
  • Figure 4: Qualitative comparison of generated ground-level images on the Auto Arborist dataset. We compare our method with ControlNet (CntrlNet), InstructPix2Pix (Inst P2P), BBDM, and Sat2Density (S2D).
  • Figure 5: Effect of removing height map conditioning. Without the height map, the model produces distorted ground-view images with degraded structural fidelity and incorrect object placements, highlighting the importance of spatial context for accurate synthesis.
  • ...and 1 more figures