Table of Contents
Fetching ...

Composite likelihood inference for the Poisson log-normal model

Julien Stoehr, Stephane S. Robin

TL;DR

This paper tackles parameter inference for the Poisson log-normal model in multivariate count data by marrying EM with composite likelihood and importance sampling (ISEM). By using block-wise composite likelihood and a mixture Gaussian proposal, the method achieves maximum-likelihood-like asymptotics and valid uncertainty quantification for moderately high-dimensional problems, while mitigating the computational bottleneck of high-dimensional latent integration. The authors derive CL-EM updates, establish variance estimation via Godambe information, and provide robust block designs to estimate in-block latent covariances. Empirical results on simulated data and the Barents Sea fish dataset show that CL-ISEM yields reliable inference and competitive, scalable performance compared to variational approaches, with the added advantage of principled standard errors and hypothesis tests.

Abstract

The Poisson log-normal model is a latent variable model that provides a generic framework for the analysis of multivariate count data. Inferring its parameters can be a daunting task since the conditional distribution of the latent variables given the observed ones is intractable. For this model, variational approaches are the golden standard solution as they prove to be computationally efficient but lack theoretical guarantees on the estimates. Sampling-based solutions are quite the opposite. We first define a Monte Carlo EM algorithm that can achieve maximum likelihood estimators, but that is computationally efficient only for low-dimensional latent spaces. We then propose a novel inference procedure combining the EM framework with composite likelihood and importance sampling estimates. The algorithm preserves the desirable asymptotic properties of maximum likelihood estimators while circumventing the high-dimensional integration bottleneck, thus maintaining computational feasibility for moderately large datasets. This approach enables grounded parameter estimation, confidence intervals, and hypothesis testing. Application to the Barents Sea fish dataset demonstrates the algorithm capacity to identify significant environmental effects and residual interspecies correlations.

Composite likelihood inference for the Poisson log-normal model

TL;DR

This paper tackles parameter inference for the Poisson log-normal model in multivariate count data by marrying EM with composite likelihood and importance sampling (ISEM). By using block-wise composite likelihood and a mixture Gaussian proposal, the method achieves maximum-likelihood-like asymptotics and valid uncertainty quantification for moderately high-dimensional problems, while mitigating the computational bottleneck of high-dimensional latent integration. The authors derive CL-EM updates, establish variance estimation via Godambe information, and provide robust block designs to estimate in-block latent covariances. Empirical results on simulated data and the Barents Sea fish dataset show that CL-ISEM yields reliable inference and competitive, scalable performance compared to variational approaches, with the added advantage of principled standard errors and hypothesis tests.

Abstract

The Poisson log-normal model is a latent variable model that provides a generic framework for the analysis of multivariate count data. Inferring its parameters can be a daunting task since the conditional distribution of the latent variables given the observed ones is intractable. For this model, variational approaches are the golden standard solution as they prove to be computationally efficient but lack theoretical guarantees on the estimates. Sampling-based solutions are quite the opposite. We first define a Monte Carlo EM algorithm that can achieve maximum likelihood estimators, but that is computationally efficient only for low-dimensional latent spaces. We then propose a novel inference procedure combining the EM framework with composite likelihood and importance sampling estimates. The algorithm preserves the desirable asymptotic properties of maximum likelihood estimators while circumventing the high-dimensional integration bottleneck, thus maintaining computational feasibility for moderately large datasets. This approach enables grounded parameter estimation, confidence intervals, and hypothesis testing. Application to the Barents Sea fish dataset demonstrates the algorithm capacity to identify significant environmental effects and residual interspecies correlations.
Paper Structure (32 sections, 4 theorems, 67 equations, 13 figures, 5 tables)

This paper contains 32 sections, 4 theorems, 67 equations, 13 figures, 5 tables.

Key Result

Proposition 1

If $p_{{\boldsymbol{\theta}}}({\mathbf{Z}} \;|\; {\mathbf{Y}}^{(b)}) = p_{\boldsymbol{\theta}}({\mathbf{Z}}^{(b)}\;|\;{\mathbf{Y}}^{(b)})$, using Algorithm algo:CL-EM yields a sequence $({\boldsymbol{\theta}}^{(h)})_{h\in\mathbb{N}}$ such that $c\ell_{{\boldsymbol{\theta}}^{(h+1)}}({\mathbf{Y}}) \ge

Figures (13)

  • Figure 1: Number of blocks $C$ as a function of the number of species $p$ (in log-log-scale) for blocks of size $k=2$ (black squares $\blacksquare$), $k=3$ (blue circles $\medcircle$), $k=5$ (red triangles up $\triangle$) and $k=7$ (green triangles down $\triangledown$). Solid line: number of blocks actually used, dashed line: upper bound ${{p}\choose{k}}$, dotted line: lower bound $p(p-1)/[k(k-1)]$.
  • Figure 2: Distribution of the $p$-values from the Kolmogorov–Smirnov test applied to the distribution of the standardized estimates ${\widetilde{\beta}}_{\ell j}$ over the $M = 100$ simulations, for each inference method: full likelihood (FL), composite likelihood with blocks of size $k$ (CL$k$), variational EM (VEM), and jackknife-based variational EM (JK). Each boxplot summarizes the $d \times p = 3p$ normalized coefficients ${\widetilde{\beta}}_{\ell j}$. Dotted red lines: $\alpha = 5\%$ significance threshold after Bonferroni correction (i.e., $\alpha/(dp)$).
  • Figure 3: Effective sample size (ESS) over the first 50 iterations for the different inference methods on the reduced Barents Sea dataset ($p = 7$ species). From top to bottom: full likelihood (FL) and composite likelihood (CL$k$) with block sizes $k = 2, 3$, and $5$. One boxplot is shown per iteration, summarizing the ESS values across all blocks and all sites. Horizontal red dashed line: median ESS across the remaining iterations.
  • Figure 4: Comparison of the estimates obtained using different inference methods on the Barents Sea reduced dataset ($p = 7$ species). Left: estimated regression coefficients $\widehat{{\boldsymbol{B}}}=(\widehat{\beta}_{hj})$; center: estimated covariance parameters $\widehat{{\boldsymbol{\Sigma}}} = (\widehat{\sigma}_{jk})$; right: estimated variances of the regression coefficients $\widehat{\beta}_{hj}$. $x$-axis: estimates from full likelihood inference (FL), $y$-axis: estimates from the other methods: FL (gray asterisk [$*$], used as the reference), composite likelihood (CL$k$) with blocks of size $k = 2$ (cyan diamond [$\meddiamond$]), $k = 3$ (green plus sign [$+$]), and $k = 5$ (red triangle [$\triangle$]), and variational EM (VEM, black circles [$\medcircle$]).
  • Figure 5: Results on the Barents Sea full dataset ($p = 30$ species). Left: estimated regression coefficients $\widehat{{\boldsymbol{B}}} = (\widehat{\beta}_{\ell j})$; center: estimated covariance parameters $\widehat{{\boldsymbol{\Sigma}}} = (\widehat{\sigma}_{jk})$. $x$-axis: estimates obtained using the composite likelihood method $CLk$ with blocks of size $k = 5$ (CL5, red triangle [$\triangle$] = reference); $y$-axis: estimates obtained using variational EM (VEM, black circles [$\medcircle$]), CL3 (green plus sign [$+$]), and CL7 (blue times sign [$\times$]). Right: boxplot of the effective sample size across all sites and blocks as a function of the iteration number for the CL5 algorithm.
  • ...and 8 more figures

Theorems & Definitions (4)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 4