Table of Contents
Fetching ...

Spatial Clustering of Citizen Science Data Improves Downstream Species Distribution Models

Nahian Ahmed, Mark Roth, Tyler A. Hallman, W. Douglas Robinson, Rebecca A. Hutchinson

TL;DR

This paper tackles imperfect detection in citizen-science biodiversity data by evaluating how post hoc site clustering influences occupancy-based species distribution models (SDMs). It compares ten clustering approaches, including ML spatial methods like clustGeo and DBSC, and uses BayesOpt to tune clustering parameters, validating on eBird data for 31 Oregon bird species. The study finds that clustering methods that preserve all observations and incorporate environmental similarity generally outperform purely geographic baselines, with best-clustGeo achieving the highest mean predictive performance when tuned to each species. These findings offer practical guidance for constructing observational units from opportunistic data and highlight the value of integrating environmental features and automated parameter tuning to improve downstream SDMs in conservation and management contexts.

Abstract

Citizen science biodiversity data present great opportunities for ecology and conservation across vast spatial and temporal scales. However, the opportunistic nature of these data lacks the sampling structure required by modeling methodologies that address a pervasive challenge in ecological data collection: imperfect detection, i.e., the likelihood of under-observing species on field surveys. Occupancy modeling is an example of an approach that accounts for imperfect detection by explicitly modeling the observation process separately from the biological process of habitat selection. This produces species distribution models that speak to the pattern of the species on a landscape after accounting for imperfect detection in the data, rather than the pattern of species observations corrupted by errors. To achieve this benefit, occupancy models require multiple surveys of a site across which the site's status (i.e., occupied or not) is assumed constant. Since citizen science data are not collected under the required repeated-visit protocol, observations may be grouped into sites post hoc. Existing approaches for constructing sites discard some observations and/or consider only geographic distance and not environmental similarity. In this study, we compare ten approaches for site construction in terms of their impact on downstream species distribution models for 31 bird species in Oregon, using observations recorded in the eBird database. We find that occupancy models built on sites constructed by spatial clustering algorithms perform better than existing alternatives.

Spatial Clustering of Citizen Science Data Improves Downstream Species Distribution Models

TL;DR

This paper tackles imperfect detection in citizen-science biodiversity data by evaluating how post hoc site clustering influences occupancy-based species distribution models (SDMs). It compares ten clustering approaches, including ML spatial methods like clustGeo and DBSC, and uses BayesOpt to tune clustering parameters, validating on eBird data for 31 Oregon bird species. The study finds that clustering methods that preserve all observations and incorporate environmental similarity generally outperform purely geographic baselines, with best-clustGeo achieving the highest mean predictive performance when tuned to each species. These findings offer practical guidance for constructing observational units from opportunistic data and highlight the value of integrating environmental features and automated parameter tuning to improve downstream SDMs in conservation and management contexts.

Abstract

Citizen science biodiversity data present great opportunities for ecology and conservation across vast spatial and temporal scales. However, the opportunistic nature of these data lacks the sampling structure required by modeling methodologies that address a pervasive challenge in ecological data collection: imperfect detection, i.e., the likelihood of under-observing species on field surveys. Occupancy modeling is an example of an approach that accounts for imperfect detection by explicitly modeling the observation process separately from the biological process of habitat selection. This produces species distribution models that speak to the pattern of the species on a landscape after accounting for imperfect detection in the data, rather than the pattern of species observations corrupted by errors. To achieve this benefit, occupancy models require multiple surveys of a site across which the site's status (i.e., occupied or not) is assumed constant. Since citizen science data are not collected under the required repeated-visit protocol, observations may be grouped into sites post hoc. Existing approaches for constructing sites discard some observations and/or consider only geographic distance and not environmental similarity. In this study, we compare ten approaches for site construction in terms of their impact on downstream species distribution models for 31 bird species in Oregon, using observations recorded in the eBird database. We find that occupancy models built on sites constructed by spatial clustering algorithms perform better than existing alternatives.

Paper Structure

This paper contains 18 sections, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Graphical representation of occupancy model. Latent variable $Z_i \in \{0,1\}$ represents occupancy at site $i=1,...,M$ and $Y_{it} = \{0,1\}$ represents the observation during $t = 1,...,T_i$. $X_i$ represent site features and $W_{it}$ represent survey features.
  • Figure 2: Simulated example of site formation using clustGeo and DBSC clustering algorithms. eBird observation locations from southwest Oregon, United States, are shown as red dots overlaid on satellite imagery from the corresponding region. clustGeo aggregates points iteraritvely and stops when the desired number of clusters is reached. Newly created clusters at each step are shown using bold dashed circles and ellipses. DBSC constructs a Delaunay Triangulation (shown using orange triangles) and then splits it based on spatial constraints and feature similarity.
  • Figure 3: Observation locations from eBird checklists in 2017 and 2018 recorded over southwest Oregon, United States, are shown as red dots. Of these, there were 2,497 checklists at 1,314 unique locations in 2017, and 3,490 checklists at 1,519 unique locations in 2018.
  • Figure 4: Boxplots show the percentage AUC improvement of each method over lat-long. Larger positive values indicate better performance than lat-long; negative values indicate worse performance than lat-long.
  • Figure 5: Occupancy probability of Northern Flicker (Colaptes auratus) over southwestern Oregon, United States predicted by species distribution models built from sites produced by ten clustering algorithms.
  • ...and 10 more figures