Table of Contents
Fetching ...

AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis

Khiem Vuong, Anurag Ghosh, Deva Ramanan, Srinivasa Narasimhan, Shubham Tulsiani

TL;DR

The paper tackles cross-view aerial-ground reconstruction where extreme viewpoint changes hinder learning-based methods. It introduces AerialMegaDepth, a scalable hybrid dataset that combines pseudo-synthetic renderings from 3D city meshes with real, crowd-sourced ground images, co-registered into a shared coordinate frame. Fine-tuning state-of-the-art reconstruction and view-synthesis models on this data yields substantial improvements in cross-view pose estimation, 3D reconstruction, and novel-view synthesis for aerial-ground pairs. The work demonstrates that leveraging geospatial platforms and crowd-sourced imagery can dramatically expand cross-view training data, enabling more robust large-scale aerial-ground 3D modeling.

Abstract

We explore the task of geometric reconstruction of images captured from a mixture of ground and aerial views. Current state-of-the-art learning-based approaches fail to handle the extreme viewpoint variation between aerial-ground image pairs. Our hypothesis is that the lack of high-quality, co-registered aerial-ground datasets for training is a key reason for this failure. Such data is difficult to assemble precisely because it is difficult to reconstruct in a scalable way. To overcome this challenge, we propose a scalable framework combining pseudo-synthetic renderings from 3D city-wide meshes (e.g., Google Earth) with real, ground-level crowd-sourced images (e.g., MegaDepth). The pseudo-synthetic data simulates a wide range of aerial viewpoints, while the real, crowd-sourced images help improve visual fidelity for ground-level images where mesh-based renderings lack sufficient detail, effectively bridging the domain gap between real images and pseudo-synthetic renderings. Using this hybrid dataset, we fine-tune several state-of-the-art algorithms and achieve significant improvements on real-world, zero-shot aerial-ground tasks. For example, we observe that baseline DUSt3R localizes fewer than 5% of aerial-ground pairs within 5 degrees of camera rotation error, while fine-tuning with our data raises accuracy to nearly 56%, addressing a major failure point in handling large viewpoint changes. Beyond camera estimation and scene reconstruction, our dataset also improves performance on downstream tasks like novel-view synthesis in challenging aerial-ground scenarios, demonstrating the practical value of our approach in real-world applications.

AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis

TL;DR

The paper tackles cross-view aerial-ground reconstruction where extreme viewpoint changes hinder learning-based methods. It introduces AerialMegaDepth, a scalable hybrid dataset that combines pseudo-synthetic renderings from 3D city meshes with real, crowd-sourced ground images, co-registered into a shared coordinate frame. Fine-tuning state-of-the-art reconstruction and view-synthesis models on this data yields substantial improvements in cross-view pose estimation, 3D reconstruction, and novel-view synthesis for aerial-ground pairs. The work demonstrates that leveraging geospatial platforms and crowd-sourced imagery can dramatically expand cross-view training data, enabling more robust large-scale aerial-ground 3D modeling.

Abstract

We explore the task of geometric reconstruction of images captured from a mixture of ground and aerial views. Current state-of-the-art learning-based approaches fail to handle the extreme viewpoint variation between aerial-ground image pairs. Our hypothesis is that the lack of high-quality, co-registered aerial-ground datasets for training is a key reason for this failure. Such data is difficult to assemble precisely because it is difficult to reconstruct in a scalable way. To overcome this challenge, we propose a scalable framework combining pseudo-synthetic renderings from 3D city-wide meshes (e.g., Google Earth) with real, ground-level crowd-sourced images (e.g., MegaDepth). The pseudo-synthetic data simulates a wide range of aerial viewpoints, while the real, crowd-sourced images help improve visual fidelity for ground-level images where mesh-based renderings lack sufficient detail, effectively bridging the domain gap between real images and pseudo-synthetic renderings. Using this hybrid dataset, we fine-tune several state-of-the-art algorithms and achieve significant improvements on real-world, zero-shot aerial-ground tasks. For example, we observe that baseline DUSt3R localizes fewer than 5% of aerial-ground pairs within 5 degrees of camera rotation error, while fine-tuning with our data raises accuracy to nearly 56%, addressing a major failure point in handling large viewpoint changes. Beyond camera estimation and scene reconstruction, our dataset also improves performance on downstream tasks like novel-view synthesis in challenging aerial-ground scenarios, demonstrating the practical value of our approach in real-world applications.

Paper Structure

This paper contains 10 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: First row: Examples of our generated cross-view (aerial-ground) geometry data, including co-registered pseudo-synthetic (i.e., mesh-rendered) aerial and real ground-level images, with corresponding depth maps, point clouds, and camera intrinsics/extrinsics in a unified coordinate system, for a variety of scenes. Second row: Leveraging such data curated over 137 landmarks and 132K geo-registered images, we show significant improvements in learning-based methods on real unseen ground-aerial scenarios across two representative tasks: 1) multi-view geometry prediction using DUSt3R dust3r_cvpr24 finetuned on our data, and 2) novel view synthesis from a single image conditioned on a target pose by fine-tuning ZeroNVS zeronvs that was originally trained on MegaScenes tung2024megascenes.
  • Figure 2: Overview of the data generation framework. To address the challenges of ground-aerial camera registration and novel-view synthesis, we propose a flexible framework combining pseudo-synthetic renderings from 3D city-wide meshes (e.g. Google Earth) with real, ground-level images (e.g. MegaDepth megadepth). The pseudo-synthetic data is captured at varying altitudes, while the real, crowd-sourced images help improve visual fidelity especially for ground-level images where mesh-based renderings lack detail. The pipeline generates pseudo-synthetic images from different altitudes, co-registers them with real images, and aligns ground-level images with aerial data for 3D reconstruction. This hybrid dataset of real and pseudo-synthetic images provides geometric supervision that helps improve performance on downstream tasks such as ground-aerial camera registration and novel view synthesis, particularly in ground-aerial settings.
  • Figure 3: Feature matching between real and pseudo-synthetic images. The pseudo-synthetic rendering has a noticeable domain gap compared to the real MegaDepth image (e.g., no transients, simplistic lighting) but still enables reliable feature matching superglue to register real images into the pseudo-synthetic reconstruction.
  • Figure 4: AerialMegaDepth data (top: MegaDepth, bottom: Google Earth) features diverse viewpoints & lighting conditions.
  • Figure 5: Zero-shot ground-aerial camera and geometry prediction results. Given two input images, one aerial and one ground, we compare the performance of the baseline DUSt3R dust3r_cvpr24 with the model fine-tuned on our varying-altitude data. The results demonstrate significant improvements over the baseline in unseen, challenging ground-aerial scenarios, showing the effectiveness of fine-tuning DUSt3R dust3r_cvpr24 with our data. Additionally, the last column presents qualitative results on a challenging ground-aerial pair from the WxBS wxbs dataset, which involves significant viewpoint change.
  • ...and 4 more figures