Table of Contents
Fetching ...

Imbalance-aware Presence-only Loss Function for Species Distribution Modeling

Robin Zbinden, Nina van Tiel, Marc Rußwurm, Devis Tuia

TL;DR

This work tackles learning from presence-only citizen-science data for species distributions under severe long-tail imbalance. It evaluates imbalance-aware losses, notably a full-weighted loss with species weights $w_s = \frac{n}{n_{\text{p}(s)}} = \frac{1}{\text{freq}(s)}$ and a PA balance parameter $\lambda_2$, across GeoLifeCLEF 2023 and iNaturalist with the S&T, IUCN, and Geo Prior tasks. The results show that the full-weighted loss with $\lambda_2=0.5$ yields the strongest rare-species performance and improves overall AUC in several tasks, indicating substantial benefits for conservation-relevant rare species. These findings highlight the importance of balancing training signals for long-tail presence-only data and suggest a practical default for imbalance-aware learning in large-scale SDMs.

Abstract

In the face of significant biodiversity decline, species distribution models (SDMs) are essential for understanding the impact of climate change on species habitats by connecting environmental conditions to species occurrences. Traditionally limited by a scarcity of species observations, these models have significantly improved in performance through the integration of larger datasets provided by citizen science initiatives. However, they still suffer from the strong class imbalance between species within these datasets, often resulting in the penalization of rare species--those most critical for conservation efforts. To tackle this issue, this study assesses the effectiveness of training deep learning models using a balanced presence-only loss function on large citizen science-based datasets. We demonstrate that this imbalance-aware loss function outperforms traditional loss functions across various datasets and tasks, particularly in accurately modeling rare species with limited observations.

Imbalance-aware Presence-only Loss Function for Species Distribution Modeling

TL;DR

This work tackles learning from presence-only citizen-science data for species distributions under severe long-tail imbalance. It evaluates imbalance-aware losses, notably a full-weighted loss with species weights and a PA balance parameter , across GeoLifeCLEF 2023 and iNaturalist with the S&T, IUCN, and Geo Prior tasks. The results show that the full-weighted loss with yields the strongest rare-species performance and improves overall AUC in several tasks, indicating substantial benefits for conservation-relevant rare species. These findings highlight the importance of balancing training signals for long-tail presence-only data and suggest a practical default for imbalance-aware learning in large-scale SDMs.

Abstract

In the face of significant biodiversity decline, species distribution models (SDMs) are essential for understanding the impact of climate change on species habitats by connecting environmental conditions to species occurrences. Traditionally limited by a scarcity of species observations, these models have significantly improved in performance through the integration of larger datasets provided by citizen science initiatives. However, they still suffer from the strong class imbalance between species within these datasets, often resulting in the penalization of rare species--those most critical for conservation efforts. To tackle this issue, this study assesses the effectiveness of training deep learning models using a balanced presence-only loss function on large citizen science-based datasets. We demonstrate that this imbalance-aware loss function outperforms traditional loss functions across various datasets and tasks, particularly in accurately modeling rare species with limited observations.
Paper Structure (9 sections, 3 equations, 4 figures, 1 table)

This paper contains 9 sections, 3 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Distributions of the number of presence records in the GeoLifeCLEF 2023 (left) and iNaturalist (right) training datasets, obtained through citizen science initiatives. Both distributions exhibit a long-tailed pattern, which is crucial to address to avoid penalizing rare species during training.
  • Figure 2: Performance of the loss functions, grouped by the number of presences records of species in the training set. The $\mathcal{L}_{\text{full-weighted}}$ loss, defined here with $\lambda_2 = 0.5$, is beneficial for rare species.
  • Figure 3: Distribution of the number of training presences of the species considered in the different tasks. The GLC23 training set contains the same species used in testing.
  • Figure 4: Performance of the loss functions on the S&T dataset, grouped by the number of presences records of species in the training set.