Bipartite Graph Variational Auto-Encoder with Fair Latent Representation to Account for Sampling Bias in Ecological Networks
Emre Anakok, Pierre Barbillon, Colin Fontaine, Elisa Thebault
TL;DR
This paper tackles sampling bias in citizen science ecological networks by learning embeddings through a bipartite variational graph auto-encoder (BVGAE) augmented with a Hilbert-Schmidt Independence Criterion (HSIC) penalty to produce a fair latent space that is independent of continuous sampling covariates. The model extends VGAE to bipartite graphs with dual GCN encoders and a generalized random dot product decoder, and uses random Fourier features to scale HSIC estimation during training. Validation includes a simulated framework mimicking Spipoll-like sampling and application to the Spipoll dataset, showing that enforcing independence from observer experience can modify ecological inferences while maintaining or improving plant–pollinator network reconstruction at the appropriate level. The approach offers a general, scalable method for debiasing network embeddings in ecological studies and can be adapted to other domains where continuous protected variables influence sampling or observation processes.
Abstract
Citizen science monitoring programs can generate large amounts of valuable data, but are often affected by sampling bias. We focus on a citizen science initiative that records plant-pollinator interactions, with the goal of learning embeddings that summarize the observed interactions while accounting for such bias. In our approach, plant and pollinator species are embedded based on their probability of interaction. These embeddings are derived using an adaptation of variational graph autoencoders for bipartite graphs. To mitigate the influence of sampling bias, we incorporate the Hilbert-Schmidt Independence Criterion (HSIC) to ensure independence from continuous variables related to the sampling process. This allows us to integrate a fairness perspective, commonly explored in the social sciences, into the analysis of ecological data. We validate our method through a simulation study replicating key aspects of the sampling process and demonstrate its applicability and effectiveness using the Spipoll dataset.
