Table of Contents
Fetching ...

On the selection and effectiveness of pseudo-absences for species distribution modeling with deep learning

Robin Zbinden, Nina van Tiel, Benjamin Kellenberger, Lloyd Hughes, Devis Tuia

TL;DR

This paper tackles the challenge of modeling species distributions when only presence data are available by integrating pseudo-absences into multi-species neural networks through a full weighted loss. The authors introduce per-species weights and tunable mix parameters for pseudo-absences, formalized as the full-weighted loss $L_{full-weighted}$, and optimize these weights via spatial block cross-validation. Across six regional benchmarks with independent presence-absence test sets, the approach yields higher mean AUC than existing loss functions and baselines like MaxEnt, while highlighting region- and species-specific effects of pseudo-absence type. The work provides a flexible, scalable framework for SDMs with presence-only data, emphasizing the importance of accounting for sampling bias and class imbalance in multi-species neural models.

Abstract

Species distribution modeling is a highly versatile tool for understanding the intricate relationship between environmental conditions and species occurrences. However, the available data often lacks information on confirmed species absence and is limited to opportunistically sampled, presence-only observations. To overcome this limitation, a common approach is to employ pseudo-absences, which are specific geographic locations designated as negative samples. While pseudo-absences are well-established for single-species distribution models, their application in the context of multi-species neural networks remains underexplored. Notably, the significant class imbalance between species presences and pseudo-absences is often left unaddressed. Moreover, the existence of different types of pseudo-absences (e.g., random and target-group background points) adds complexity to the selection process. Determining the optimal combination of pseudo-absences types is difficult and depends on the characteristics of the data, particularly considering that certain types of pseudo-absences can be used to mitigate geographic biases. In this paper, we demonstrate that these challenges can be effectively tackled by integrating pseudo-absences in the training of multi-species neural networks through modifications to the loss function. This adjustment involves assigning different weights to the distinct terms of the loss function, thereby addressing both the class imbalance and the choice of pseudo-absence types. Additionally, we propose a strategy to set these loss weights using spatial block cross-validation with presence-only data. We evaluate our approach using a benchmark dataset containing independent presence-absence data from six different regions and report improved results when compared to competing approaches.

On the selection and effectiveness of pseudo-absences for species distribution modeling with deep learning

TL;DR

This paper tackles the challenge of modeling species distributions when only presence data are available by integrating pseudo-absences into multi-species neural networks through a full weighted loss. The authors introduce per-species weights and tunable mix parameters for pseudo-absences, formalized as the full-weighted loss , and optimize these weights via spatial block cross-validation. Across six regional benchmarks with independent presence-absence test sets, the approach yields higher mean AUC than existing loss functions and baselines like MaxEnt, while highlighting region- and species-specific effects of pseudo-absence type. The work provides a flexible, scalable framework for SDMs with presence-only data, emphasizing the importance of accounting for sampling bias and class imbalance in multi-species neural models.

Abstract

Species distribution modeling is a highly versatile tool for understanding the intricate relationship between environmental conditions and species occurrences. However, the available data often lacks information on confirmed species absence and is limited to opportunistically sampled, presence-only observations. To overcome this limitation, a common approach is to employ pseudo-absences, which are specific geographic locations designated as negative samples. While pseudo-absences are well-established for single-species distribution models, their application in the context of multi-species neural networks remains underexplored. Notably, the significant class imbalance between species presences and pseudo-absences is often left unaddressed. Moreover, the existence of different types of pseudo-absences (e.g., random and target-group background points) adds complexity to the selection process. Determining the optimal combination of pseudo-absences types is difficult and depends on the characteristics of the data, particularly considering that certain types of pseudo-absences can be used to mitigate geographic biases. In this paper, we demonstrate that these challenges can be effectively tackled by integrating pseudo-absences in the training of multi-species neural networks through modifications to the loss function. This adjustment involves assigning different weights to the distinct terms of the loss function, thereby addressing both the class imbalance and the choice of pseudo-absence types. Additionally, we propose a strategy to set these loss weights using spatial block cross-validation with presence-only data. We evaluate our approach using a benchmark dataset containing independent presence-absence data from six different regions and report improved results when compared to competing approaches.
Paper Structure (17 sections, 4 equations, 7 figures, 7 tables)

This paper contains 17 sections, 4 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Modeling species distributions often involves utilizing presence-only (PO) data, where information about species absences is unavailable. To apply machine learning techniques in such situations, pseudo-absences are used as a contrast to presence data. There are primarily two types of pseudo-absences: target-group background points and random background points. Target-group background points consist of presences of other species, sharing a similar sampling bias, while random background points are uniformly sampled within the area. In this paper, we emphasize the critical importance of how these different types are managed during training for optimal performance, especially when dealing with neural networks. We support our approach by evaluating it on an independent test set comprised of presence-absence (PA) data.
  • Figure 2: Species occurrence records generally exhibit sampling biases, as depicted here by training presences in the dataset from elith2020presence. (a) The number of presences per species follows a long-tailed distribution, with many species having only a limited number of available observations. To address this issue, we incorporate a species weight $w_s$ for each species in our loss function. (b) The geographic distribution of the presence records of all species shows varying biases across regions. We introduce the pseudo-absence weight $\lambda_2$ to mitigate this bias. Additional plots for the remaining regions, not presented here, can be found in \ref{['sec:dataappendix']}.
  • Figure 3: $k$-fold block cross-validation roberts2017cross is used to find the optimal value of the pseudo-absence weight $\lambda_2$. It involves the spatial partitioning of presence observations into the training and validation sets, comprising respectively 80% and 20% of the samples. The presence records considered here pertain to the Swiss region of the dataset described in Section \ref{['sec:dataset']}.
  • Figure 4: Left: Impact on the AUC when using the species weight $w_s$ in the loss function, grouped by the number of presences records in the training set. The gain of employing $w_s$ is more pronounced for species with fewer presence records. Right: Impact on the AUC when using random ($\lambda_2 = 0$) or target-group ($\lambda_2$ = 1) background points, with every symbol representing a species. While many species benefit from using only target-group background points, not all do.
  • Figure 5: Focus on three distinct (anonymized) species, labeled as can12, can15, and sa29. Each species is represented by its respective training set, test set, and prediction maps. These visualizations illustrate the role and impact of different values for the pseudo-absence weight $\lambda_2$ on the prediction maps generated by the model.
  • ...and 2 more figures