Table of Contents
Fetching ...

Inferring the presence and abundance of rare waterbirds species from scarce data

Barbara Bricout, Laura Dami, Pierre Defos du Rau, Sophie Donnet, Thomas Galewski, Stephane Robin

TL;DR

The paper tackles missing and zero-inflated count data in rare waterbird monitoring by introducing ZI-PLN-PCA, a zero-inflated Poisson-Log-Normal model with a low-rank latent Gaussian layer to capture cross-year dependence across sites. It develops a variational EM inference scheme with an ELBO objective, enabling joint imputation of missing counts, estimation of covariate effects on presence and abundance, and selection of the latent dimension $q$, along with approximate confidence intervals. The framework yields conditional and marginal prediction intervals for imputations and supports temporal trend estimation and change-point detection through year-specific effects. Demonstrations on European and North African waterbird datasets show improved imputation accuracy over non-inflated models, sensible uncertainty quantification, and the ability to detect trends and regime shifts in populations of rare species, with practical implications for conservation monitoring.

Abstract

Abundance data are used in ecology for species monitoring and conservation. These count data often display several specific characteristics like numerous missing data, high variance, and a high proportion of zeros, particularly when monitoring rare species. We present a model that aims to impute missing data and estimate the effect of covariates on species presence and abundance. It is based on the log-normal Poisson model, which offers more flexibility in the variance of counts than a Poisson model. A latent variable is added for the overrepresentation of zeros in the data. The imputation of missing data is made possible by assuming that the latent variance matrix has low rank and the inclusion of covariates. \\ We demonstrate the identifiability in the presence of missing data. Since maximum likelihood inference is intractable, we use a variational expectation-maximization algorithm to infer the parameters. We provide an estimate of the asymptotic variance of the estimators and derive prediction intervals for the imputations, an estimate of the temporal trend, and a procedure for detecting a potential change in this trend. \\ We evaluate our imputations and associated prediction intervals using artificially degraded monitoring data set. We conclude with an illustration on a monitoring waterbirds data set.

Inferring the presence and abundance of rare waterbirds species from scarce data

TL;DR

The paper tackles missing and zero-inflated count data in rare waterbird monitoring by introducing ZI-PLN-PCA, a zero-inflated Poisson-Log-Normal model with a low-rank latent Gaussian layer to capture cross-year dependence across sites. It develops a variational EM inference scheme with an ELBO objective, enabling joint imputation of missing counts, estimation of covariate effects on presence and abundance, and selection of the latent dimension , along with approximate confidence intervals. The framework yields conditional and marginal prediction intervals for imputations and supports temporal trend estimation and change-point detection through year-specific effects. Demonstrations on European and North African waterbird datasets show improved imputation accuracy over non-inflated models, sensible uncertainty quantification, and the ability to detect trends and regime shifts in populations of rare species, with practical implications for conservation monitoring.

Abstract

Abundance data are used in ecology for species monitoring and conservation. These count data often display several specific characteristics like numerous missing data, high variance, and a high proportion of zeros, particularly when monitoring rare species. We present a model that aims to impute missing data and estimate the effect of covariates on species presence and abundance. It is based on the log-normal Poisson model, which offers more flexibility in the variance of counts than a Poisson model. A latent variable is added for the overrepresentation of zeros in the data. The imputation of missing data is made possible by assuming that the latent variance matrix has low rank and the inclusion of covariates. \\ We demonstrate the identifiability in the presence of missing data. Since maximum likelihood inference is intractable, we use a variational expectation-maximization algorithm to infer the parameters. We provide an estimate of the asymptotic variance of the estimators and derive prediction intervals for the imputations, an estimate of the temporal trend, and a procedure for detecting a potential change in this trend. \\ We evaluate our imputations and associated prediction intervals using artificially degraded monitoring data set. We conclude with an illustration on a monitoring waterbirds data set.
Paper Structure (55 sections, 3 theorems, 42 equations, 9 figures, 2 tables, 2 algorithms)

This paper contains 55 sections, 3 theorems, 42 equations, 9 figures, 2 tables, 2 algorithms.

Key Result

Proposition 1

Under Model mod:ZIPLNPCA, we have where $\Sigma = (\sigma_{jk})_{(j,k)\in [p]^2}$ and $\sigma_j^2 = \sigma_{jj}$.

Figures (9)

  • Figure 1: Illustrations of $\mathcal{Q}$ defined in Proposition \ref{['prop:identif2']} for $p=9$ and $q=2$.
  • Figure 2: Predictions under the MAR time--site missingness scenario for four missing rates (5%, 30%, 50%, 70%). Top left: ratio $|Y_{ij}-\widetilde{Y}_{ij}(X^{G}, ZI-PLN-PCA\xspace)|\,/\,|Y_{ij}-\widetilde{Y}_{ij}((X^{F}, ZI-PLN-PCA\xspace)|$. Top right: ratio $|Y_{ij}-\widetilde{Y}_{ij}(X^F, ZI-PLN-PCA\xspace)|\,/\,|Y_{ij}-\widetilde{Y}_{ij}(X^F, PLN-PCA\xspace)|$. Bottom left: width of conditional prediction intervals from the ZI-PLN-PCA model for the MAR site and years scenario of missing data. Bottom right: width of conditional prediction intervals from the ZI-PLN-PCA model for the MCAR scenario of missing data.
  • Figure 3: Overview of data structure and abundance distribution for species B across North Africa (1990--2023). (a) Spatio-temporal distribution of presence records and missing data across wetland sites. (b) Distribution of observed abundances on a log$_{10}$ scale for both axis.
  • Figure 4: Map of the estimated population of species B in 2023 across 419 North African sites. Blue points indicate observed counts, whereas orange points correspond to conditionally imputed values.
  • Figure 5: Temporal trends in abundance and probability of presence of species B across North Africa (1990--2023), as estimated from the fitted model.
  • ...and 4 more figures

Theorems & Definitions (6)

  • Proposition 1
  • Proposition 2: Identifiability of $(\beta, \gamma, \Sigma)$
  • proof
  • Proposition 3: Identifiability of $(\beta, \gamma, \Sigma)$ when $\Sigma$ has rank $q < p$
  • proof
  • proof