Table of Contents
Fetching ...

Heterogeneous graph neural networks for species distribution modeling

Lauren Harrell, Christine Kaeser-Chen, Burcu Karagol Ayan, Keith Anderson, Michelangelo Conserva, Elise Kleeman, Maxim Neumann, Matt Overlan, Melissa Chapman, Drew Purves

TL;DR

This work tackles species distribution modeling with presence-only data by introducing a heterogeneous graph neural network that treats locations and species as bipartite node sets connected by detection edges. The model learns embeddings through message passing and uses a link-prediction objective to infer species–location occurrences, evaluated on the six-region NCEAS benchmarks. Results show the GNN approach often matches or surpasses traditional single-species SDMs and a baseline MLP, highlighting the benefits of multi-species learning and relational information. The study demonstrates the potential of flexible graph-based representations to integrate species traits, environmental covariates, and detection processes, with future work aimed at richer data fusion and additional edge types for improved ecological modeling.

Abstract

Species distribution models (SDMs) are necessary for measuring and predicting occurrences and habitat suitability of species and their relationship with environmental factors. We introduce a novel presence-only SDM with graph neural networks (GNN). In our model, species and locations are treated as two distinct node sets, and the learning task is predicting detection records as the edges that connect locations to species. Using GNN for SDM allows us to model fine-grained interactions between species and the environment. We evaluate the potential of this methodology on the six-region dataset compiled by National Center for Ecological Analysis and Synthesis (NCEAS) for benchmarking SDMs. For each of the regions, the heterogeneous GNN model is comparable to or outperforms previously-benchmarked single-species SDMs as well as a feed-forward neural network baseline model.

Heterogeneous graph neural networks for species distribution modeling

TL;DR

This work tackles species distribution modeling with presence-only data by introducing a heterogeneous graph neural network that treats locations and species as bipartite node sets connected by detection edges. The model learns embeddings through message passing and uses a link-prediction objective to infer species–location occurrences, evaluated on the six-region NCEAS benchmarks. Results show the GNN approach often matches or surpasses traditional single-species SDMs and a baseline MLP, highlighting the benefits of multi-species learning and relational information. The study demonstrates the potential of flexible graph-based representations to integrate species traits, environmental covariates, and detection processes, with future work aimed at richer data fusion and additional edge types for improved ecological modeling.

Abstract

Species distribution models (SDMs) are necessary for measuring and predicting occurrences and habitat suitability of species and their relationship with environmental factors. We introduce a novel presence-only SDM with graph neural networks (GNN). In our model, species and locations are treated as two distinct node sets, and the learning task is predicting detection records as the edges that connect locations to species. Using GNN for SDM allows us to model fine-grained interactions between species and the environment. We evaluate the potential of this methodology on the six-region dataset compiled by National Center for Ecological Analysis and Synthesis (NCEAS) for benchmarking SDMs. For each of the regions, the heterogeneous GNN model is comparable to or outperforms previously-benchmarked single-species SDMs as well as a feed-forward neural network baseline model.

Paper Structure

This paper contains 20 sections, 6 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Example of bipartite heterogeneous graph structure where species nodes are connected to location nodes through detection edges. The first graph on the left shows one way message passing from locations $v^L_i\in\mathcal{V}^L$ to species $v^S_j\in\mathcal{V}^S$ through detections $\mathcal{E}^{L2S}_{\text{Detection}}$. The middle bipartite graph includes the reversed edge set $\mathcal{E}^{S2L}_{\text{Detection}}$ that sends information from species nodes back to location nodes. The graph on the right includes message passing through (pseudo)-negative edges as a distinct edge set.
  • Figure 2: $\text{AUC}_\text{ROC}$ averaged across species per site by region and model methodology. The values of prior results were taken from the top scoring models in Valavi2022-pr. Top result per region highlighted in blue.
  • Figure 3: NCEAS Data. (A) The locations of each of the six regions around the globe shaded by number of taxonomic groups. (B) Example presence and presence/absence observations for one species in each of the SWT and NZ regions. (C) Counts of species per taxonomic group in each region.