Table of Contents
Fetching ...

DroneSplat: 3D Gaussian Splatting for Robust 3D Reconstruction from In-the-Wild Drone Imagery

Jiadong Tang, Yu Gao, Dianyi Yang, Liqi Yan, Yufeng Yue, Yi Yang

TL;DR

DroneSplat addresses robust 3D reconstruction from in-the-wild drone imagery by tackling dynamic distractors and limited-view constraints. It combines Adaptive Local-Global Masking to detect and suppress dynamic regions with voxel-guided optimization that leverages dense priors from multi-view stereo (via DUSt3R) to constrain Gaussian primitives. The method integrates geometric priors, per-voxel constraints, and segmentation-guided masking to achieve high-quality static reconstructions in challenging wild scenes, outperforming NeRF-based and 3DGS baselines across dynamic and sparse-view scenarios. A drone-captured 24-sequence dataset further demonstrates the approach's practical value for real-world aerial reconstruction tasks, with implications for urban surveying and cultural heritage preservation.

Abstract

Drones have become essential tools for reconstructing wild scenes due to their outstanding maneuverability. Recent advances in radiance field methods have achieved remarkable rendering quality, providing a new avenue for 3D reconstruction from drone imagery. However, dynamic distractors in wild environments challenge the static scene assumption in radiance fields, while limited view constraints hinder the accurate capture of underlying scene geometry. To address these challenges, we introduce DroneSplat, a novel framework designed for robust 3D reconstruction from in-the-wild drone imagery. Our method adaptively adjusts masking thresholds by integrating local-global segmentation heuristics with statistical approaches, enabling precise identification and elimination of dynamic distractors in static scenes. We enhance 3D Gaussian Splatting with multi-view stereo predictions and a voxel-guided optimization strategy, supporting high-quality rendering under limited view constraints. For comprehensive evaluation, we provide a drone-captured 3D reconstruction dataset encompassing both dynamic and static scenes. Extensive experiments demonstrate that DroneSplat outperforms both 3DGS and NeRF baselines in handling in-the-wild drone imagery.

DroneSplat: 3D Gaussian Splatting for Robust 3D Reconstruction from In-the-Wild Drone Imagery

TL;DR

DroneSplat addresses robust 3D reconstruction from in-the-wild drone imagery by tackling dynamic distractors and limited-view constraints. It combines Adaptive Local-Global Masking to detect and suppress dynamic regions with voxel-guided optimization that leverages dense priors from multi-view stereo (via DUSt3R) to constrain Gaussian primitives. The method integrates geometric priors, per-voxel constraints, and segmentation-guided masking to achieve high-quality static reconstructions in challenging wild scenes, outperforming NeRF-based and 3DGS baselines across dynamic and sparse-view scenarios. A drone-captured 24-sequence dataset further demonstrates the approach's practical value for real-world aerial reconstruction tasks, with implications for urban surveying and cultural heritage preservation.

Abstract

Drones have become essential tools for reconstructing wild scenes due to their outstanding maneuverability. Recent advances in radiance field methods have achieved remarkable rendering quality, providing a new avenue for 3D reconstruction from drone imagery. However, dynamic distractors in wild environments challenge the static scene assumption in radiance fields, while limited view constraints hinder the accurate capture of underlying scene geometry. To address these challenges, we introduce DroneSplat, a novel framework designed for robust 3D reconstruction from in-the-wild drone imagery. Our method adaptively adjusts masking thresholds by integrating local-global segmentation heuristics with statistical approaches, enabling precise identification and elimination of dynamic distractors in static scenes. We enhance 3D Gaussian Splatting with multi-view stereo predictions and a voxel-guided optimization strategy, supporting high-quality rendering under limited view constraints. For comprehensive evaluation, we provide a drone-captured 3D reconstruction dataset encompassing both dynamic and static scenes. Extensive experiments demonstrate that DroneSplat outperforms both 3DGS and NeRF baselines in handling in-the-wild drone imagery.

Paper Structure

This paper contains 24 sections, 16 equations, 20 figures, 3 tables.

Figures (20)

  • Figure 1: Given a set of drone imagery, our method effectively eliminates the impact of dynamic distractors on the static scenes (e.g., vehicles driving on the road). The right side of the figure shows the rendering results of our method compared to 3DGS under input view and novel view. Our method eliminates distractors while 3DGS generates artifacts in the corresponding regions. Moreover, our method reconstructs scenes with accurate geometry under limited viewpoints, demonstrating robustness to significant view variation (novel view).
  • Figure 2: The challenges within in-the-wild drone imagery. Certain regions within the entire scene will face the challenge of dynamic distractors and limited view constraints.
  • Figure 3: The framework of DroneSplat. Given a few posed drone imagery of a wild scene, our goal is to identify and eliminate dynamic distractors. We first predict a dense point cloud through a learning-based multi-view stereo method, followed by point sampling based on confidence and geometric features. The sampled point cloud is used to initialize Gaussian primitives, which is then optimized using a voxel-guided strategy. At iteration $t=n$, we calculate the normalized residual of the rendered image and combine it with the segmentation results to obtain the object-wise residuals. We adaptively adjust the threshold based on the object-wise residuals and statistical approaches to obtain local masks. Meanwhile, we mark objects with high residuals as tracking candidates, deriving the global set at $t=n$ by combining the global set at $t=n-1$ with the tracking outcomes at $t=n$. After merging the local mask and the global mask retrieved from the global set, we can obtain the final mask at time $t= n$. The mask set illustrates the dynamic distractors we predicted.
  • Figure 4: The effect of Adaptive Local Masking. (a) represent the renderings of the same frame across different iterations $t$, and (b) show the corresponding object-wise residuals. (c) are the masks obtained using a hard threshold, while (d) are the masks obtained by Adaptive Local Masking.
  • Figure 5: The effect of Complement Global Masking. At $t=n$, the white car waiting at a red light is not identified by the Adaptive Local Masking, but it is tracked through Complement Global Masking in other frames.
  • ...and 15 more figures