Imbalance-aware Presence-only Loss Function for Species Distribution Modeling
Robin Zbinden, Nina van Tiel, Marc Rußwurm, Devis Tuia
TL;DR
This work tackles learning from presence-only citizen-science data for species distributions under severe long-tail imbalance. It evaluates imbalance-aware losses, notably a full-weighted loss with species weights $w_s = \frac{n}{n_{\text{p}(s)}} = \frac{1}{\text{freq}(s)}$ and a PA balance parameter $\lambda_2$, across GeoLifeCLEF 2023 and iNaturalist with the S&T, IUCN, and Geo Prior tasks. The results show that the full-weighted loss with $\lambda_2=0.5$ yields the strongest rare-species performance and improves overall AUC in several tasks, indicating substantial benefits for conservation-relevant rare species. These findings highlight the importance of balancing training signals for long-tail presence-only data and suggest a practical default for imbalance-aware learning in large-scale SDMs.
Abstract
In the face of significant biodiversity decline, species distribution models (SDMs) are essential for understanding the impact of climate change on species habitats by connecting environmental conditions to species occurrences. Traditionally limited by a scarcity of species observations, these models have significantly improved in performance through the integration of larger datasets provided by citizen science initiatives. However, they still suffer from the strong class imbalance between species within these datasets, often resulting in the penalization of rare species--those most critical for conservation efforts. To tackle this issue, this study assesses the effectiveness of training deep learning models using a balanced presence-only loss function on large citizen science-based datasets. We demonstrate that this imbalance-aware loss function outperforms traditional loss functions across various datasets and tasks, particularly in accurately modeling rare species with limited observations.
