Table of Contents
Fetching ...

Habitat Classification from Ground-Level Imagery Using Deep Neural Networks

Hongrui Shi, Lisa Norton, Lucy Ridding, Simon Rolph, Tom August, Claire M Wood, Lan Qie, Petra Bosilj, James M Brown

Abstract

Habitat assessment at local scales--critical for enhancing biodiversity and guiding conservation priorities--often relies on expert field surveys that can be costly, motivating the exploration of AI-driven tools to automate and refine this process. While most AI-driven habitat mapping depends on remote sensing, it is often constrained by sensor availability, weather, and coarse resolution. In contrast, ground-level imagery captures essential structural and compositional cues invisible from above and remains underexplored for robust, fine-grained habitat classification. This study addresses this gap by applying state-of-the-art deep neural network architectures to ground-level habitat imagery. Leveraging data from the UK Countryside Survey covering 18 broad habitat types, we evaluate two families of models - convolutional neural networks (CNNs) and vision transformers (ViTs) - under both supervised and supervised contrastive learning paradigms. Our results demonstrate that ViTs consistently outperform state-of-the-art CNN baselines on key classification metrics (Top-3 accuracy = 91%, MCC = 0.66) and offer more interpretable scene understanding tailored to ground-level images. Moreover, supervised contrastive learning significantly reduces misclassification rates among visually similar habitats (e.g., Improved vs. Neutral Grassland), driven by a more discriminative embedding space. Finally, our best model performs on par with experienced ecological experts in habitat classification from images, underscoring the promise of expert-level automated assessment. By integrating advanced AI with ecological expertise, this research establishes a scalable, cost-effective framework for ground-level habitat monitoring to accelerate biodiversity conservation and inform land-use decisions at a national scale.

Habitat Classification from Ground-Level Imagery Using Deep Neural Networks

Abstract

Habitat assessment at local scales--critical for enhancing biodiversity and guiding conservation priorities--often relies on expert field surveys that can be costly, motivating the exploration of AI-driven tools to automate and refine this process. While most AI-driven habitat mapping depends on remote sensing, it is often constrained by sensor availability, weather, and coarse resolution. In contrast, ground-level imagery captures essential structural and compositional cues invisible from above and remains underexplored for robust, fine-grained habitat classification. This study addresses this gap by applying state-of-the-art deep neural network architectures to ground-level habitat imagery. Leveraging data from the UK Countryside Survey covering 18 broad habitat types, we evaluate two families of models - convolutional neural networks (CNNs) and vision transformers (ViTs) - under both supervised and supervised contrastive learning paradigms. Our results demonstrate that ViTs consistently outperform state-of-the-art CNN baselines on key classification metrics (Top-3 accuracy = 91%, MCC = 0.66) and offer more interpretable scene understanding tailored to ground-level images. Moreover, supervised contrastive learning significantly reduces misclassification rates among visually similar habitats (e.g., Improved vs. Neutral Grassland), driven by a more discriminative embedding space. Finally, our best model performs on par with experienced ecological experts in habitat classification from images, underscoring the promise of expert-level automated assessment. By integrating advanced AI with ecological expertise, this research establishes a scalable, cost-effective framework for ground-level habitat monitoring to accelerate biodiversity conservation and inform land-use decisions at a national scale.

Paper Structure

This paper contains 25 sections, 7 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Examples of level 3 (L3) habitats defined by UKHab in the Countryside Survey dataset, grouped by their coarse L2 categories (bold text). Some L2 habitats, such as Cropland, only have one L3 class. Note that Improved Grassland, Montane, and Bracken have different origins than L3 habitats in UKHab but are treated as L3 habitats in the CS dataset. An explanation is provided in Section \ref{['sec: method data']}.
  • Figure 1: UMAPs of the embedding space produced by the image encoder on test set. SupCon is found to generate tightly clusters that are more separable from each other compared to supervised learning. Habitat symbols in this figure: $\bullet$ Acid Grassland;$\bullet$ Arable and Horticulture;$\bullet$ Bog; $\bullet$ Bracken; $\bullet$ Broadleaved Mixed and Yew Woodland; $\bullet$ Calcareous Grassland;$\bullet$ Coniferous Woodland; $\bullet$ Dwarf Shrub Heath; $\bullet$ Fen, Marsh, Swamp; $\bullet$ Improved Grassland; $\bullet$ Inland Rock; $\bullet$ Littoral Sediment; $\bullet$ Montane; $\bullet$ Neutral Grassland; $\bullet$ Supra-littoral Sediment; $\bullet$ Urban.
  • Figure 2: Habitat distributions (L2 and L3) in the CS dataset based on the UKHab system.
  • Figure 2: Delta CM of WRN-50-2 with SupCon: based on the CM produced by SupCon.
  • Figure 3: The conceptual difference between how CNN and ViT process the same ground-level habitat image. A ground-level photograph of a mixed woodland scene is partitioned into a conceptional 3×3 grid of patches. In the top row, the CNN extracts visual features using a sliding filter that only sees one patch at a time, making it tend to focus on prominent features in local regions. This locality is reflected in the GradCAM heatmap, where the model concentrates on the trees in the central region (heated area). In contrast, the ViT (bottom row) allows each patch to “attend” to every other patch through self-attention links. The lines in the diagram are a schematic way to show this idea—the thicker lines simply indicate stronger relationships between patches. The ViT therefore considers the global context of the scene, linking near and far regions. The GradCAM heatmap of ViT highlights areas not only on the central trees but also those along the edges, reflecting a broader contextual understanding. This behaviour aligns more closely with human interpretation of ground-level habitat photos, where habitat context often spans the entire scene rather than a single region.
  • ...and 10 more figures