Table of Contents
Fetching ...

BirdNeRF: Fast Neural Reconstruction of Large-Scale Scenes From Aerial Imagery

Huiqing Zhang, Yifei Xue, Ming Liao, Yizhen Lao

TL;DR

BirdNeRF addresses the challenge of fast, high-fidelity large-scale 3D reconstruction from aerial imagery by decomposing the scene into sub-scenes based on camera distribution and training them independently with Instant-NGP. A projection-guided re-rendering pipeline then fuses outputs from relevant sub-models to render novel views, using ground-plane geometry to index and align sub-scenes. This split-unite paradigm yields up to approximately $10\times$ faster reconstruction than Metashape and more than $50\times$ faster than current large-scale NeRF approaches on a single GPU, while preserving rendering quality. The approach demonstrates robustness across diverse urban to campus datasets and has practical impact for rapid urban modeling, disaster response, and planning applications where memory and time constraints are critical.

Abstract

In this study, we introduce BirdNeRF, an adaptation of Neural Radiance Fields (NeRF) designed specifically for reconstructing large-scale scenes using aerial imagery. Unlike previous research focused on small-scale and object-centric NeRF reconstruction, our approach addresses multiple challenges, including (1) Addressing the issue of slow training and rendering associated with large models. (2) Meeting the computational demands necessitated by modeling a substantial number of images, requiring extensive resources such as high-performance GPUs. (3) Overcoming significant artifacts and low visual fidelity commonly observed in large-scale reconstruction tasks due to limited model capacity. Specifically, we present a novel bird-view pose-based spatial decomposition algorithm that decomposes a large aerial image set into multiple small sets with appropriately sized overlaps, allowing us to train individual NeRFs of sub-scene. This decomposition approach not only decouples rendering time from the scene size but also enables rendering to scale seamlessly to arbitrarily large environments. Moreover, it allows for per-block updates of the environment, enhancing the flexibility and adaptability of the reconstruction process. Additionally, we propose a projection-guided novel view re-rendering strategy, which aids in effectively utilizing the independently trained sub-scenes to generate superior rendering results. We evaluate our approach on existing datasets as well as against our own drone footage, improving reconstruction speed by 10x over classical photogrammetry software and 50x over state-of-the-art large-scale NeRF solution, on a single GPU with similar rendering quality.

BirdNeRF: Fast Neural Reconstruction of Large-Scale Scenes From Aerial Imagery

TL;DR

BirdNeRF addresses the challenge of fast, high-fidelity large-scale 3D reconstruction from aerial imagery by decomposing the scene into sub-scenes based on camera distribution and training them independently with Instant-NGP. A projection-guided re-rendering pipeline then fuses outputs from relevant sub-models to render novel views, using ground-plane geometry to index and align sub-scenes. This split-unite paradigm yields up to approximately faster reconstruction than Metashape and more than faster than current large-scale NeRF approaches on a single GPU, while preserving rendering quality. The approach demonstrates robustness across diverse urban to campus datasets and has practical impact for rapid urban modeling, disaster response, and planning applications where memory and time constraints are critical.

Abstract

In this study, we introduce BirdNeRF, an adaptation of Neural Radiance Fields (NeRF) designed specifically for reconstructing large-scale scenes using aerial imagery. Unlike previous research focused on small-scale and object-centric NeRF reconstruction, our approach addresses multiple challenges, including (1) Addressing the issue of slow training and rendering associated with large models. (2) Meeting the computational demands necessitated by modeling a substantial number of images, requiring extensive resources such as high-performance GPUs. (3) Overcoming significant artifacts and low visual fidelity commonly observed in large-scale reconstruction tasks due to limited model capacity. Specifically, we present a novel bird-view pose-based spatial decomposition algorithm that decomposes a large aerial image set into multiple small sets with appropriately sized overlaps, allowing us to train individual NeRFs of sub-scene. This decomposition approach not only decouples rendering time from the scene size but also enables rendering to scale seamlessly to arbitrarily large environments. Moreover, it allows for per-block updates of the environment, enhancing the flexibility and adaptability of the reconstruction process. Additionally, we propose a projection-guided novel view re-rendering strategy, which aids in effectively utilizing the independently trained sub-scenes to generate superior rendering results. We evaluate our approach on existing datasets as well as against our own drone footage, improving reconstruction speed by 10x over classical photogrammetry software and 50x over state-of-the-art large-scale NeRF solution, on a single GPU with similar rendering quality.
Paper Structure (20 sections, 7 equations, 11 figures, 3 tables, 1 algorithm)

This paper contains 20 sections, 7 equations, 11 figures, 3 tables, 1 algorithm.

Figures (11)

  • Figure 1: Illustration of modular scene training, along with performance and time comparisons on the IZAA dataset (comprising 1469 images). We demonstrate an approximately 10x speed improvement over traditional Metashape software. Moreover, when compared to current large-scale reconstruction approaches using deep learning, our method exhibits an approximately 56x enhancement in speed.
  • Figure 2: The BirdNeRF pipeline is initiated by the preprocessing phase \ref{['sec:init']}, where input images are processed to obtain camera positions and coefficient point clouds. Spatial decomposition follows in section \ref{['sec:spa_dcp']}, categorizing cameras into clusters. For each cluster, associated images facilitate independent training mentioned in section \ref{['sec:train']}, creating multiple sub-scenes. The novel projection-guided view re-rendering strategy described in section \ref{['sec:projection']} synthesizes the final rendering images.
  • Figure 3: Sub-scenes extension. The strategic expansion of sub-scenes enhances scene overlap, thereby elevating the success rate of post-image registration in our proposed approach.
  • Figure 4: Projection-guided novel view re-rendering. Beginning with independently constructed input NeRFs, namely Sub-scene A and Sub-scene B, we perform image rendering from novel viewpoints. Then, employing a sequence of image stitching and fusion techniques, we achieve higher-quality re-rendering results.
  • Figure 5: Ground plane fitting and pixel projection.
  • ...and 6 more figures