Table of Contents
Fetching ...

Classifying geospatial objects from multiview aerial imagery using semantic meshes

David Russell, Ben Weinstein, David Wettergreen, Derek Young

TL;DR

This paper tackles the limitations of orthomosaic-based analyses in geospatial mapping by introducing a multiview semantic mesh approach that operates on raw drone imagery from multiple viewpoints. It presents a complete workflow for building 3D semantic meshes, training with per-face labels, and projecting predictions back to geospatial coordinates, along with an open-source toolkit. A field-validated four-site forest dataset is released to support multiview tree species classification, and cross-site experiments show that low-oblique multiview imagery (MV-LO) yields the best performance, achieving 75% accuracy versus 53% for the orthomosaic baseline. The findings demonstrate the practical value of multiview information and oblique perspectives for robust, scalable geospatial prediction in forest ecosystems, with implications for ecological mapping and resource management.

Abstract

Aerial imagery is increasingly used in Earth science and natural resource management as a complement to labor-intensive ground-based surveys. Aerial systems can collect overlapping images that provide multiple views of each location from different perspectives. However, most prediction approaches (e.g. for tree species classification) use a single, synthesized top-down "orthomosaic" image as input that contains little to no information about the vertical aspects of objects and may include processing artifacts. We propose an alternate approach that generates predictions directly on the raw images and accurately maps these predictions into geospatial coordinates using semantic meshes. This method$\unicode{x2013}$released as a user-friendly open-source toolkit$\unicode{x2013}$enables analysts to use the highest quality data for predictions, capture information about the sides of objects, and leverage multiple viewpoints of each location for added robustness. We demonstrate the value of this approach on a new benchmark dataset of four forest sites in the western U.S. that consists of drone images, photogrammetry results, predicted tree locations, and species classification data derived from manual surveys. We show that our proposed multiview method improves classification accuracy from 53% to 75% relative to an orthomosaic baseline on a challenging cross-site tree species classification task.

Classifying geospatial objects from multiview aerial imagery using semantic meshes

TL;DR

This paper tackles the limitations of orthomosaic-based analyses in geospatial mapping by introducing a multiview semantic mesh approach that operates on raw drone imagery from multiple viewpoints. It presents a complete workflow for building 3D semantic meshes, training with per-face labels, and projecting predictions back to geospatial coordinates, along with an open-source toolkit. A field-validated four-site forest dataset is released to support multiview tree species classification, and cross-site experiments show that low-oblique multiview imagery (MV-LO) yields the best performance, achieving 75% accuracy versus 53% for the orthomosaic baseline. The findings demonstrate the practical value of multiview information and oblique perspectives for robust, scalable geospatial prediction in forest ecosystems, with implications for ecological mapping and resource management.

Abstract

Aerial imagery is increasingly used in Earth science and natural resource management as a complement to labor-intensive ground-based surveys. Aerial systems can collect overlapping images that provide multiple views of each location from different perspectives. However, most prediction approaches (e.g. for tree species classification) use a single, synthesized top-down "orthomosaic" image as input that contains little to no information about the vertical aspects of objects and may include processing artifacts. We propose an alternate approach that generates predictions directly on the raw images and accurately maps these predictions into geospatial coordinates using semantic meshes. This methodreleased as a user-friendly open-source toolkitenables analysts to use the highest quality data for predictions, capture information about the sides of objects, and leverage multiple viewpoints of each location for added robustness. We demonstrate the value of this approach on a new benchmark dataset of four forest sites in the western U.S. that consists of drone images, photogrammetry results, predicted tree locations, and species classification data derived from manual surveys. We show that our proposed multiview method improves classification accuracy from 53% to 75% relative to an orthomosaic baseline on a challenging cross-site tree species classification task.
Paper Structure (27 sections, 6 equations, 5 figures, 3 tables)

This paper contains 27 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The same tree in (a) an orthomosaic created from nadir drone images collected from 120 m above ground level, (b) a raw nadir drone image collected from 120 m above ground level, and (c) a raw oblique drone image collected from 80 m above ground level. Distortions due to orthorectification are apparent in (a).
  • Figure 2: Schematic of the analytical workflow, with input data (c, d, j, l) outlined in green and output data (i) outlined in blue. At the sites used to train the computer vision model, raw drone images (d) are collected and processed using photogrammetry to estimate camera poses and a 3D geospatial mesh model (a), which is then textured using geospatial ground-truth species labels (c) obtained via field surveys. The semantic mesh tool presented here is then used to render species labels (e) that match pixel-for-pixel with the raw drone images. The raw images and labels are then used to train a computer vision semantic segmentation model (f). To classify trees to species at a new site, another set of drone images (j) is collected and segmented pixel-for-pixel into species classes (g) using the trained model. In parallel, the images are processed using photogrammetry to yield camera poses and a 3D mesh model (k). The semantic mesh tool presented here is then used to project the image-based species classes onto the mesh faces (h). The geospatial locations of trees (l) are determined, and the mesh-based species classes are transferred to the geospatial tree map to yield the final output, a tree map with inferred species labels (l).
  • Figure 3: Confusion matrices for the cross-site tree species classification task, with species counts summed across the four sites and three trials.
  • Figure 4: Labels rendered during the training process. The left pane shows the label IDs colored by class, the middle shows the raw images, and the right shows the classes overlayed on a grayscale image.
  • Figure 5: Site-level confusion matrices for the leave-one-site-out tree species classification task, summed over all trials, for the orthomosaic (ortho.) dataset (a-d), high-nadir multiview (MV-HN) dataset (e-h), and low-oblique multiview (MV-LO) dataset (i-l).