Predicting Species Occurrence Patterns from Partial Observations
Hager Radi Abdelwahed, Mélisande Teng, David Rolnick
TL;DR
The paper tackles predicting species encounter patterns from satellite imagery when observational data are partial, introducing SatButterfly and a cross-taxon framework to transfer knowledge from data-rich taxa to data-scarce ones. It presents R-Tran, a regression transformer that fuses image features with partial species information via target and state embeddings, trained with masked labels and applicable in flexible inference scenarios. Empirical results show R-Tran outperforms baselines in both within-taxon and cross-taxon tasks, highlighting the value of leveraging cross-taxon relationships to improve biodiversity monitoring. The work advances practical ecological modeling by enabling joint predictions across taxa using citizen-science data and remote sensing, with potential extensions to presence-only datasets like iNaturalist.
Abstract
To address the interlinked biodiversity and climate crises, we need an understanding of where species occur and how these patterns are changing. However, observational data on most species remains very limited, and the amount of data available varies greatly between taxonomic groups. We introduce the problem of predicting species occurrence patterns given (a) satellite imagery, and (b) known information on the occurrence of other species. To evaluate algorithms on this task, we introduce SatButterfly, a dataset of satellite images, environmental data and observational data for butterflies, which is designed to pair with the existing SatBird dataset of bird observational data. To address this task, we propose a general model, R-Tran, for predicting species occurrence patterns that enables the use of partial observational data wherever found. We find that R-Tran outperforms other methods in predicting species encounter rates with partial information both within a taxon (birds) and across taxa (birds and butterflies). Our approach opens new perspectives to leveraging insights from species with abundant data to other species with scarce data, by modelling the ecosystems in which they co-occur.
