Table of Contents
Fetching ...

Bipartite Graph Variational Auto-Encoder with Fair Latent Representation to Account for Sampling Bias in Ecological Networks

Emre Anakok, Pierre Barbillon, Colin Fontaine, Elisa Thebault

TL;DR

This paper tackles sampling bias in citizen science ecological networks by learning embeddings through a bipartite variational graph auto-encoder (BVGAE) augmented with a Hilbert-Schmidt Independence Criterion (HSIC) penalty to produce a fair latent space that is independent of continuous sampling covariates. The model extends VGAE to bipartite graphs with dual GCN encoders and a generalized random dot product decoder, and uses random Fourier features to scale HSIC estimation during training. Validation includes a simulated framework mimicking Spipoll-like sampling and application to the Spipoll dataset, showing that enforcing independence from observer experience can modify ecological inferences while maintaining or improving plant–pollinator network reconstruction at the appropriate level. The approach offers a general, scalable method for debiasing network embeddings in ecological studies and can be adapted to other domains where continuous protected variables influence sampling or observation processes.

Abstract

Citizen science monitoring programs can generate large amounts of valuable data, but are often affected by sampling bias. We focus on a citizen science initiative that records plant-pollinator interactions, with the goal of learning embeddings that summarize the observed interactions while accounting for such bias. In our approach, plant and pollinator species are embedded based on their probability of interaction. These embeddings are derived using an adaptation of variational graph autoencoders for bipartite graphs. To mitigate the influence of sampling bias, we incorporate the Hilbert-Schmidt Independence Criterion (HSIC) to ensure independence from continuous variables related to the sampling process. This allows us to integrate a fairness perspective, commonly explored in the social sciences, into the analysis of ecological data. We validate our method through a simulation study replicating key aspects of the sampling process and demonstrate its applicability and effectiveness using the Spipoll dataset.

Bipartite Graph Variational Auto-Encoder with Fair Latent Representation to Account for Sampling Bias in Ecological Networks

TL;DR

This paper tackles sampling bias in citizen science ecological networks by learning embeddings through a bipartite variational graph auto-encoder (BVGAE) augmented with a Hilbert-Schmidt Independence Criterion (HSIC) penalty to produce a fair latent space that is independent of continuous sampling covariates. The model extends VGAE to bipartite graphs with dual GCN encoders and a generalized random dot product decoder, and uses random Fourier features to scale HSIC estimation during training. Validation includes a simulated framework mimicking Spipoll-like sampling and application to the Spipoll dataset, showing that enforcing independence from observer experience can modify ecological inferences while maintaining or improving plant–pollinator network reconstruction at the appropriate level. The approach offers a general, scalable method for debiasing network embeddings in ecological studies and can be adapted to other domains where continuous protected variables influence sampling or observation processes.

Abstract

Citizen science monitoring programs can generate large amounts of valuable data, but are often affected by sampling bias. We focus on a citizen science initiative that records plant-pollinator interactions, with the goal of learning embeddings that summarize the observed interactions while accounting for such bias. In our approach, plant and pollinator species are embedded based on their probability of interaction. These embeddings are derived using an adaptation of variational graph autoencoders for bipartite graphs. To mitigate the influence of sampling bias, we incorporate the Hilbert-Schmidt Independence Criterion (HSIC) to ensure independence from continuous variables related to the sampling process. This allows us to integrate a fairness perspective, commonly explored in the social sciences, into the analysis of ecological data. We validate our method through a simulation study replicating key aspects of the sampling process and demonstrate its applicability and effectiveness using the Spipoll dataset.
Paper Structure (46 sections, 34 equations, 17 figures, 6 tables)

This paper contains 46 sections, 34 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Average number of observations per insect order as a function of observer experience level in the Spipoll data set.
  • Figure 2: Summary of the model used for the training of the Spipoll data set
  • Figure 3: Numerical replication of the sampling process. We simulate various level experience, which describes how many insect the user will observe during a session (top left). During the session, the user can only observe interactions given by one randomly selected row of $B'_0$ (bottom left), with probability depending on the difficulty. The observed insect are reported in the session-pollinator network (top right), which give us $B$. By aggregating this network by plants, we then have the observed network $B'$ (bottom right).
  • Figure 4: Estimated probabilities $\widehat{B'}$ of connection between plants and insects on the Spipoll data set obtained with BVGAE (top) and the fair-BVGAE (bottom). Each row and column represent respectively a genus of plant and insect, which have been grouped by taxonomic orders.
  • Figure 5: Focus on the embeddings provided by the fair-BVGAE for the observation sessions performed on the genera Daucus, Leucanthenmum and Lavandula.
  • ...and 12 more figures