Table of Contents
Fetching ...

DivShift: Exploring Domain-Specific Distribution Shifts in Large-Scale, Volunteer-Collected Biodiversity Datasets

Elena Sierra, Lauren E. Gillespie, Salim Soltani, Moises Exposito-Alonso, Teja Kattenborn

TL;DR

DivShift addresses how domain-specific biases in volunteer-collected biodiversity data affect fine-grained species recognition. The authors formalize a framework that partitions data by known biases and uses Jensen-Shannon Distance to relate label distribution shifts to model performance changes, enabling quantification of bias impact. They introduce DivShift-NAWC, a large North American West Coast dataset with five bias partitions to study these effects in a controlled setting, plus thorough experiments across architectures. The findings show biases often induce smaller performance shifts than label shifts, but effects are bias-dependent, informing when and how to use volunteer data for biodiversity monitoring tasks and guiding data collection priorities.

Abstract

Large-scale, volunteer-collected datasets of community-identified natural world imagery like iNaturalist have enabled marked performance gains for fine-grained visual classification of species using machine learning methods. However, such data -- sometimes referred to as citizen science data -- are opportunistic and lack a structured sampling strategy. This volunteer-collected biodiversity data contains geographic, temporal, taxonomic, observers, and sociopolitical biases that can have significant effects on biodiversity model performance, but whose impacts are unclear for fine-grained species recognition performance. Here we introduce Diversity Shift (DivShift), a framework for quantifying the effects of domain-specific distribution shifts on machine learning model performance. To diagnose the performance effects of biases specific to volunteer-collected biodiversity data, we also introduce DivShift - North American West Coast (DivShift-NAWC), a curated dataset of almost 7.5 million iNaturalist images across the western coast of North America partitioned across five types of expert-verified bias. We compare species recognition performance across these bias partitions using a diverse variety of species- and ecosystem-focused accuracy metrics. We observe that these biases confound model performance less than expected from the underlying label distribution shift, and that more data leads to better model performance but the magnitude of these improvements are bias-specific. These findings imply that while the structure within natural world images provides generalization improvements for biodiversity monitoring tasks, the biases present in volunteer-collected biodiversity data can also affect model performance; thus these models should be used with caution in downstream biodiversity monitoring tasks.

DivShift: Exploring Domain-Specific Distribution Shifts in Large-Scale, Volunteer-Collected Biodiversity Datasets

TL;DR

DivShift addresses how domain-specific biases in volunteer-collected biodiversity data affect fine-grained species recognition. The authors formalize a framework that partitions data by known biases and uses Jensen-Shannon Distance to relate label distribution shifts to model performance changes, enabling quantification of bias impact. They introduce DivShift-NAWC, a large North American West Coast dataset with five bias partitions to study these effects in a controlled setting, plus thorough experiments across architectures. The findings show biases often induce smaller performance shifts than label shifts, but effects are bias-dependent, informing when and how to use volunteer data for biodiversity monitoring tasks and guiding data collection priorities.

Abstract

Large-scale, volunteer-collected datasets of community-identified natural world imagery like iNaturalist have enabled marked performance gains for fine-grained visual classification of species using machine learning methods. However, such data -- sometimes referred to as citizen science data -- are opportunistic and lack a structured sampling strategy. This volunteer-collected biodiversity data contains geographic, temporal, taxonomic, observers, and sociopolitical biases that can have significant effects on biodiversity model performance, but whose impacts are unclear for fine-grained species recognition performance. Here we introduce Diversity Shift (DivShift), a framework for quantifying the effects of domain-specific distribution shifts on machine learning model performance. To diagnose the performance effects of biases specific to volunteer-collected biodiversity data, we also introduce DivShift - North American West Coast (DivShift-NAWC), a curated dataset of almost 7.5 million iNaturalist images across the western coast of North America partitioned across five types of expert-verified bias. We compare species recognition performance across these bias partitions using a diverse variety of species- and ecosystem-focused accuracy metrics. We observe that these biases confound model performance less than expected from the underlying label distribution shift, and that more data leads to better model performance but the magnitude of these improvements are bias-specific. These findings imply that while the structure within natural world images provides generalization improvements for biodiversity monitoring tasks, the biases present in volunteer-collected biodiversity data can also affect model performance; thus these models should be used with caution in downstream biodiversity monitoring tasks.

Paper Structure

This paper contains 41 sections, 1 equation, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Biases present in biodiversity data include (a) spatial bias, (b) temporal bias, (c) taxonomic bias, (d) observer behavior bias, and (e) sociopolitical bias.
  • Figure 2: The Diversity Shift (DivShift) Framework (a) quantifies impacts of domain-specific biases by first partitioning data into partitions $P_A$ and $P_B$ using expert-verified types of bias. Bias impacts are then quantified by measuring the accuracy of models trained on $P_{A train}$ using $P_{A test}$ and $P_{B test}$ which is further compared to (b) the distribution shift between labels in $P_{A train}$ to labels in $P_{A test}$ and $P_{B test}$ using the Jensen-Shannon Distance (JSD).
  • Figure 3: Overview of DivShift--North American West Coast Dataset (DivShift-NAWC).(a) Density plot of the DivShift-NAWC's iNaturalist observations inat. Observations are skewed to U.S. and coastal states. (b) DivShift-NAWC spans a diverse set of habitats and ecosystems ecoregions, (c) along with climates WorldClim. (d) DivShift-NAWC observations are concentrated in human-modified areas HumanFootprint.
  • Figure 4: Biases in the DivShift-NAWC dataset. (a) Human footprint index HumanFootprint across human-modified and wilderness areas. (b) Observations per-day, with City Nature Challenge spike highlighted. (c). Observations per-observer with casual/engaged lines highlighted. (d) Density of observations in shared ecoregions across Arizona-Sonora border.
  • Figure A1: Class-balanced training help in most cases (All Species), but at the cost of common species performance (By Species Rarity).
  • ...and 5 more figures