DivShift: Exploring Domain-Specific Distribution Shifts in Large-Scale, Volunteer-Collected Biodiversity Datasets

Elena Sierra; Lauren E. Gillespie; Salim Soltani; Moises Exposito-Alonso; Teja Kattenborn

DivShift: Exploring Domain-Specific Distribution Shifts in Large-Scale, Volunteer-Collected Biodiversity Datasets

Elena Sierra, Lauren E. Gillespie, Salim Soltani, Moises Exposito-Alonso, Teja Kattenborn

TL;DR

DivShift addresses how domain-specific biases in volunteer-collected biodiversity data affect fine-grained species recognition. The authors formalize a framework that partitions data by known biases and uses Jensen-Shannon Distance to relate label distribution shifts to model performance changes, enabling quantification of bias impact. They introduce DivShift-NAWC, a large North American West Coast dataset with five bias partitions to study these effects in a controlled setting, plus thorough experiments across architectures. The findings show biases often induce smaller performance shifts than label shifts, but effects are bias-dependent, informing when and how to use volunteer data for biodiversity monitoring tasks and guiding data collection priorities.

Abstract

Large-scale, volunteer-collected datasets of community-identified natural world imagery like iNaturalist have enabled marked performance gains for fine-grained visual classification of species using machine learning methods. However, such data -- sometimes referred to as citizen science data -- are opportunistic and lack a structured sampling strategy. This volunteer-collected biodiversity data contains geographic, temporal, taxonomic, observers, and sociopolitical biases that can have significant effects on biodiversity model performance, but whose impacts are unclear for fine-grained species recognition performance. Here we introduce Diversity Shift (DivShift), a framework for quantifying the effects of domain-specific distribution shifts on machine learning model performance. To diagnose the performance effects of biases specific to volunteer-collected biodiversity data, we also introduce DivShift - North American West Coast (DivShift-NAWC), a curated dataset of almost 7.5 million iNaturalist images across the western coast of North America partitioned across five types of expert-verified bias. We compare species recognition performance across these bias partitions using a diverse variety of species- and ecosystem-focused accuracy metrics. We observe that these biases confound model performance less than expected from the underlying label distribution shift, and that more data leads to better model performance but the magnitude of these improvements are bias-specific. These findings imply that while the structure within natural world images provides generalization improvements for biodiversity monitoring tasks, the biases present in volunteer-collected biodiversity data can also affect model performance; thus these models should be used with caution in downstream biodiversity monitoring tasks.

DivShift: Exploring Domain-Specific Distribution Shifts in Large-Scale, Volunteer-Collected Biodiversity Datasets

TL;DR

Abstract

DivShift: Exploring Domain-Specific Distribution Shifts in Large-Scale, Volunteer-Collected Biodiversity Datasets

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)