Table of Contents
Fetching ...

ProDiG: Progressive Diffusion-Guided Gaussian Splatting for Aerial to Ground Reconstruction

Sirshapan Mitra, Yogesh S. Rawat

Abstract

Generating ground-level views and coherent 3D site models from aerial-only imagery is challenging due to extreme viewpoint changes, missing intermediate observations, and large scale variations. Existing methods either refine renderings post-hoc, often producing geometrically inconsistent results, or rely on multi-altitude ground-truth, which is rarely available. Gaussian Splatting and diffusion-based refinements improve fidelity under small variations but fail under wide aerial-to-ground gaps. To address these limitations, we introduce ProDiG (Progressive Altitude Gaussian Splatting), a diffusion-guided framework that progressively transforms aerial 3D representations toward ground-level fidelity. ProDiG synthesizes intermediate-altitude views and refines the Gaussian representation at each stage using a geometry-aware causal attention module that injects epipolar structure into reference-view diffusion. A distance-adaptive Gaussian module dynamically adjusts Gaussian scale and opacity based on camera distance, ensuring stable reconstruction across large viewpoint gaps. Together, these components enable progressive, geometrically grounded refinement without requiring additional ground-truth viewpoints. Extensive experiments on synthetic and real-world datasets demonstrate that ProDiG produces visually realistic ground-level renderings and coherent 3D geometry, significantly outperforming existing approaches in terms of visual quality, geometric consistency, and robustness to extreme viewpoint changes.

ProDiG: Progressive Diffusion-Guided Gaussian Splatting for Aerial to Ground Reconstruction

Abstract

Generating ground-level views and coherent 3D site models from aerial-only imagery is challenging due to extreme viewpoint changes, missing intermediate observations, and large scale variations. Existing methods either refine renderings post-hoc, often producing geometrically inconsistent results, or rely on multi-altitude ground-truth, which is rarely available. Gaussian Splatting and diffusion-based refinements improve fidelity under small variations but fail under wide aerial-to-ground gaps. To address these limitations, we introduce ProDiG (Progressive Altitude Gaussian Splatting), a diffusion-guided framework that progressively transforms aerial 3D representations toward ground-level fidelity. ProDiG synthesizes intermediate-altitude views and refines the Gaussian representation at each stage using a geometry-aware causal attention module that injects epipolar structure into reference-view diffusion. A distance-adaptive Gaussian module dynamically adjusts Gaussian scale and opacity based on camera distance, ensuring stable reconstruction across large viewpoint gaps. Together, these components enable progressive, geometrically grounded refinement without requiring additional ground-truth viewpoints. Extensive experiments on synthetic and real-world datasets demonstrate that ProDiG produces visually realistic ground-level renderings and coherent 3D geometry, significantly outperforming existing approaches in terms of visual quality, geometric consistency, and robustness to extreme viewpoint changes.

Paper Structure

This paper contains 13 sections, 6 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of ProDiG.(a) Our framework reconstructs a complete 3D scene using only aerial images. A large distribution shift exists between the aerial training images and the ground-level query images. During evaluation, we render novel views at ground-level camera poses and compare them against ground-truth images. (b) In our Distance-Adaptive Gaussian Splatting module, each Gaussian is dynamically scaled and reweighted using a lightweight encoder that predicts adjustment factors from its learned scaling feature and its distance to the active camera. (c) We progressively render noisy novel views at successively lower altitudes, fix these views using our diffusion model, and iteratively retrain the Gaussian Splatting model using the fixed novel view.
  • Figure 2: Overview of aeroFix:(left) Our diffusion model is fine-tuned on aerial imagery using LoRA. The noisy novel view is fixed using the reference view to fixed novel image. In the diffusion block, the relative camera pose difference is injected into the timestep embedding of the noisy image to encode geometric variation across viewpoints. We additionally include Plücker ray embeddings before the attention mixing layer to provide geometric cues. In the Causal Attention Mixing module, we enforce an epipolar constraint by masking the novel query - reference key attention map such that only tokens aligned with the corresponding epipolar lines retain attention (value 1), while all others are suppressed (value 0). The reference query - novel key block is fully masked, and the remaining attention blocks operate under standard full attention. (right) The figure illustrates epipolar correspondences for multiple query points on the noisy image and their corresponding lines on the reference view.
  • Figure 3: Effectiveness of aeroFix: Comparison of aerial image refinement between Difixdifix and our aeroFix model. The noisy novel views are outlined in orange, the reference images in green, and the refined (fixed) novel images in pink. Difix tends to copy content from the reference view when the viewpoint difference is large, leading to inconsistencies and artifacts. In contrast, aeroFix effectively preserves structural fidelity and produces geometrically consistent, artifact-free refinements.
  • Figure 4: Qualitative analysis of ProDiG(ours): Comparison of our method with existing baselines on aerial-to-ground reconstruction. Gaussian Splatting 3dgs struggles to render complete scenes due to the absence of ground-level viewpoints, while Diffix+difix exhibits noisy artifacts and hallucinated structures. In contrast, ProDiG (ours) produces geometrically consistent and visually coherent reconstructions with fewer hallucinations. Notably, in the second row, the aerial inputs and reconstructed model include cars visible from above, whereas the ground-truth image - captured at a different time - does not.
  • Figure 5: Generalization across Varying Altitudes. We evaluate our method on the Aerial MegaDepthvuong2025aerialmegadepth dataset, which contains sites captured at diverse altitude ranges.
  • ...and 1 more figures