Table of Contents
Fetching ...

From Flight to Insight: Semantic 3D Reconstruction for Aerial Inspection via Gaussian Splatting and Language-Guided Segmentation

Mahmoud Chick Zaouali, Todd Charter, Homayoun Najjaran

TL;DR

This work addresses the lack of semantic interpretability in fast 3D reconstructions for aerial inspection by integrating language-guided segmentation with 3D Gaussian Splatting. It extends Feature-3DGS with CLIP-LSeg heatmaps and SAM/SAM2 refinement to enable open-vocabulary semantic querying on outdoor UAV scenes, and compares backbones across two real datasets. The findings show CLIP-LSeg provides coherent semantic grouping while SAM/SAM2 offer varying degrees of localization and speed, enabling a practical, two-stage segmentation pipeline. The approach demonstrates feasibility for interactive semantic analysis of large-scale outdoor environments, with future work aimed at domain-specific tuning, true 3D feature fields, and onboard deployment for real-time guidance.

Abstract

High-fidelity 3D reconstruction is critical for aerial inspection tasks such as infrastructure monitoring, structural assessment, and environmental surveying. While traditional photogrammetry techniques enable geometric modeling, they lack semantic interpretability, limiting their effectiveness for automated inspection workflows. Recent advances in neural rendering and 3D Gaussian Splatting (3DGS) offer efficient, photorealistic reconstructions but similarly lack scene-level understanding. In this work, we present a UAV-based pipeline that extends Feature-3DGS for language-guided 3D segmentation. We leverage LSeg-based feature fields with CLIP embeddings to generate heatmaps in response to language prompts. These are thresholded to produce rough segmentations, and the highest-scoring point is then used as a prompt to SAM or SAM2 for refined 2D segmentation on novel view renderings. Our results highlight the strengths and limitations of various feature field backbones (CLIP-LSeg, SAM, SAM2) in capturing meaningful structure in large-scale outdoor environments. We demonstrate that this hybrid approach enables flexible, language-driven interaction with photorealistic 3D reconstructions, opening new possibilities for semantic aerial inspection and scene understanding.

From Flight to Insight: Semantic 3D Reconstruction for Aerial Inspection via Gaussian Splatting and Language-Guided Segmentation

TL;DR

This work addresses the lack of semantic interpretability in fast 3D reconstructions for aerial inspection by integrating language-guided segmentation with 3D Gaussian Splatting. It extends Feature-3DGS with CLIP-LSeg heatmaps and SAM/SAM2 refinement to enable open-vocabulary semantic querying on outdoor UAV scenes, and compares backbones across two real datasets. The findings show CLIP-LSeg provides coherent semantic grouping while SAM/SAM2 offer varying degrees of localization and speed, enabling a practical, two-stage segmentation pipeline. The approach demonstrates feasibility for interactive semantic analysis of large-scale outdoor environments, with future work aimed at domain-specific tuning, true 3D feature fields, and onboard deployment for real-time guidance.

Abstract

High-fidelity 3D reconstruction is critical for aerial inspection tasks such as infrastructure monitoring, structural assessment, and environmental surveying. While traditional photogrammetry techniques enable geometric modeling, they lack semantic interpretability, limiting their effectiveness for automated inspection workflows. Recent advances in neural rendering and 3D Gaussian Splatting (3DGS) offer efficient, photorealistic reconstructions but similarly lack scene-level understanding. In this work, we present a UAV-based pipeline that extends Feature-3DGS for language-guided 3D segmentation. We leverage LSeg-based feature fields with CLIP embeddings to generate heatmaps in response to language prompts. These are thresholded to produce rough segmentations, and the highest-scoring point is then used as a prompt to SAM or SAM2 for refined 2D segmentation on novel view renderings. Our results highlight the strengths and limitations of various feature field backbones (CLIP-LSeg, SAM, SAM2) in capturing meaningful structure in large-scale outdoor environments. We demonstrate that this hybrid approach enables flexible, language-driven interaction with photorealistic 3D reconstructions, opening new possibilities for semantic aerial inspection and scene understanding.

Paper Structure

This paper contains 19 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: End-to-End Pipeline for Language-Guided 3D Reconstruction and Semantic Feature Field Distillation
  • Figure 2: Renderings of the 3D feature fields for the Building dataset
  • Figure 3: Renderings of the 3D feature fields for the Observatory dataset
  • Figure 4: LSeg thresholded segmentation with the prompt "stairs with metal railing"
  • Figure 5: Comparison of the point-prompted SAM segmentations.
  • ...and 1 more figures