Table of Contents
Fetching ...

DRAGON: Drone and Ground Gaussian Splatting for 3D Building Reconstruction

Yujin Ham, Mateusz Michalkiewicz, Guha Balakrishnan

TL;DR

DRAGON tackles 3D building reconstruction from drone and near-ground imagery by introducing an iterative extrapolation scheme that generates intermediate elevation views to bridge the missing-cone between elevations. It couples 3D Gaussian Splatting ($3DGS$) with perceptual regularization from DreamSim and OpenCLIP to stabilize extrapolation and enable registration, achieving near-perfect drone-ground pose alignment on a new Buildings-NVS dataset. The approach yields compelling renderings across elevations, approaching oracle performance while highlighting limitations of semi-synthetic data and potential artifacts from perceptual losses. Overall, DRAGON offers a practical pathway to scalable, large-scale building modeling from widely accessible imagery without explicit camera poses per view.

Abstract

3D building reconstruction from imaging data is an important task for many applications ranging from urban planning to reconnaissance. Modern Novel View synthesis (NVS) methods like NeRF and Gaussian Splatting offer powerful techniques for developing 3D models from natural 2D imagery in an unsupervised fashion. These algorithms generally require input training views surrounding the scene of interest, which, in the case of large buildings, is typically not available across all camera elevations. In particular, the most readily available camera viewpoints at scale across most buildings are at near-ground (e.g., with mobile phones) and aerial (drones) elevations. However, due to the significant difference in viewpoint between drone and ground image sets, camera registration - a necessary step for NVS algorithms - fails. In this work we propose a method, DRAGON, that can take drone and ground building imagery as input and produce a 3D NVS model. The key insight of DRAGON is that intermediate elevation imagery may be extrapolated by an NVS algorithm itself in an iterative procedure with perceptual regularization, thereby bridging the visual feature gap between the two elevations and enabling registration. We compiled a semi-synthetic dataset of 9 large building scenes using Google Earth Studio, and quantitatively and qualitatively demonstrate that DRAGON can generate compelling renderings on this dataset compared to baseline strategies.

DRAGON: Drone and Ground Gaussian Splatting for 3D Building Reconstruction

TL;DR

DRAGON tackles 3D building reconstruction from drone and near-ground imagery by introducing an iterative extrapolation scheme that generates intermediate elevation views to bridge the missing-cone between elevations. It couples 3D Gaussian Splatting () with perceptual regularization from DreamSim and OpenCLIP to stabilize extrapolation and enable registration, achieving near-perfect drone-ground pose alignment on a new Buildings-NVS dataset. The approach yields compelling renderings across elevations, approaching oracle performance while highlighting limitations of semi-synthetic data and potential artifacts from perceptual losses. Overall, DRAGON offers a practical pathway to scalable, large-scale building modeling from widely accessible imagery without explicit camera poses per view.

Abstract

3D building reconstruction from imaging data is an important task for many applications ranging from urban planning to reconnaissance. Modern Novel View synthesis (NVS) methods like NeRF and Gaussian Splatting offer powerful techniques for developing 3D models from natural 2D imagery in an unsupervised fashion. These algorithms generally require input training views surrounding the scene of interest, which, in the case of large buildings, is typically not available across all camera elevations. In particular, the most readily available camera viewpoints at scale across most buildings are at near-ground (e.g., with mobile phones) and aerial (drones) elevations. However, due to the significant difference in viewpoint between drone and ground image sets, camera registration - a necessary step for NVS algorithms - fails. In this work we propose a method, DRAGON, that can take drone and ground building imagery as input and produce a 3D NVS model. The key insight of DRAGON is that intermediate elevation imagery may be extrapolated by an NVS algorithm itself in an iterative procedure with perceptual regularization, thereby bridging the visual feature gap between the two elevations and enabling registration. We compiled a semi-synthetic dataset of 9 large building scenes using Google Earth Studio, and quantitatively and qualitatively demonstrate that DRAGON can generate compelling renderings on this dataset compared to baseline strategies.
Paper Structure (28 sections, 4 equations, 9 figures, 5 tables, 2 algorithms)

This paper contains 28 sections, 4 equations, 9 figures, 5 tables, 2 algorithms.

Figures (9)

  • Figure 1: Building-NVS, a dataset of images we collected from Google Earth Studio. We show views of three large buildings from drone-level (top) and ground-level (bottom) elevations. These opposing elevations offer highly contrasting viewpoints of the physical structures, which can provide complementary information towards 3D modeling. However, the lack of easily matching visual features across elevations inhibits registration, a key step in novel view synthesis algorithms like NeRF mildenhall2020nerf and 3D Gaussian Splatting kerbl20233d.
  • Figure 2: Visual depictions of the 9 buildings in our Buildings-NVS dataset. These buildings have unique characteristics from height to architecture style and backgrounds.
  • Figure 3: Train/test split of images for a given scene (Eiffel Tower) from our proposed Buildings-NVS dataset. (Left) Training data covers only drone and ground elevations. (Right) Testing data covers all five elevations including drone and ground (due to space constrains, we provide sample images from middle three elevations in this figure).
  • Figure 4: Overview of DRAGON's iterative pipeline. (Left) We iteratively render viewpoints starting from aerial elevations towards ground. Given footage from elevations $X^N, \cdots X^k$, we generate views for elevation $X^{k-1}$. (Right) We feed the accumulated views into COLMAP to obtain the corresponding camera poses and point cloud, which are then passed to the 3DGS model. The model is trained using pixel-wise and perception-wise losses (Equation \ref{['eq:dragon_loss']}). Blue color denotes training images, while green color represents images rendered for the next elevation.
  • Figure 5: DreamSim is better than LPIPS at prioritizing perceptual image quality over geometric changes. While LPIPS score in images with different geometries, (c) tilted and (d) rotated, increased compared to same geometry in image (b), DreamSim score decreased, meaning Dreamsim perceive image (c) and (d) being more similar to the reference image compared to the image (b). This suggests that DreamSim places greater emphasis on perceptual image quality rather than geometric differences (spatial displacement), as long as the image being assessed shares content similarities with the reference image. Perceptual distance measured by DreamSim is more robust to geometry changes than LPIPS metric thus being more suitable for an auxiliary loss function as a regularizer.
  • ...and 4 more figures