Table of Contents
Fetching ...

Estimating Unknown Population Sizes Using the Hypergeometric Distribution

Liam Hodgson, Danilo Bzdok

TL;DR

This work tackles the problem of estimating unknown population sizes when both the total size $N$ and category counts $N_i$ are unknown under severe under-sampling. It introduces a hypergeometric-likelihood framework with a continuous, differentiable relaxation and a variational autoencoder to model mixtures of ground-truth distributions, enabling joint inference of counts and latent structure. The approach demonstrates superior performance over multinomial and Poisson baselines in synthetic benchmarks and yields meaningful latent representations for downstream tasks, including reading passage complexity and single-cell transcriptomics. The methods have broad practical impact for finite discrete populations in domains like NLP and biology, where counts are inherently tied to an underlying, unknown universe of elements.

Abstract

The multivariate hypergeometric distribution describes sampling without replacement from a discrete population of elements divided into multiple categories. Addressing a gap in the literature, we tackle the challenge of estimating discrete distributions when both the total population size and the sizes of its constituent categories are unknown. Here, we propose a novel solution using the hypergeometric likelihood to solve this estimation challenge, even in the presence of severe under-sampling. We develop our approach to account for a data generating process where the ground-truth is a mixture of distributions conditional on a continuous latent variable, such as with collaborative filtering, using the variational autoencoder framework. Empirical data simulation demonstrates that our method outperforms other likelihood functions used to model count data, both in terms of accuracy of population size estimate and in its ability to learn an informative latent space. We demonstrate our method's versatility through applications in NLP, by inferring and estimating the complexity of latent vocabularies in text excerpts, and in biology, by accurately recovering the true number of gene transcripts from sparse single-cell genomics data.

Estimating Unknown Population Sizes Using the Hypergeometric Distribution

TL;DR

This work tackles the problem of estimating unknown population sizes when both the total size and category counts are unknown under severe under-sampling. It introduces a hypergeometric-likelihood framework with a continuous, differentiable relaxation and a variational autoencoder to model mixtures of ground-truth distributions, enabling joint inference of counts and latent structure. The approach demonstrates superior performance over multinomial and Poisson baselines in synthetic benchmarks and yields meaningful latent representations for downstream tasks, including reading passage complexity and single-cell transcriptomics. The methods have broad practical impact for finite discrete populations in domains like NLP and biology, where counts are inherently tied to an underlying, unknown universe of elements.

Abstract

The multivariate hypergeometric distribution describes sampling without replacement from a discrete population of elements divided into multiple categories. Addressing a gap in the literature, we tackle the challenge of estimating discrete distributions when both the total population size and the sizes of its constituent categories are unknown. Here, we propose a novel solution using the hypergeometric likelihood to solve this estimation challenge, even in the presence of severe under-sampling. We develop our approach to account for a data generating process where the ground-truth is a mixture of distributions conditional on a continuous latent variable, such as with collaborative filtering, using the variational autoencoder framework. Empirical data simulation demonstrates that our method outperforms other likelihood functions used to model count data, both in terms of accuracy of population size estimate and in its ability to learn an informative latent space. We demonstrate our method's versatility through applications in NLP, by inferring and estimating the complexity of latent vocabularies in text excerpts, and in biology, by accurately recovering the true number of gene transcripts from sparse single-cell genomics data.
Paper Structure (21 sections, 7 equations, 13 figures, 3 tables)

This paper contains 21 sections, 7 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Example of negative log-likelihood landscape for $K=2$ and $N=100$ ($10^4$ trials).
  • Figure 2: Maximum likelihood estimate Manhattan error for different numbers of trials at different max sample fractions. 50% confidence interval over 50 random seeds. $K=2, N=100$$(N_1=40, N_2=60)$
  • Figure 3: Maximum likelihood estimate Manhattan error per training epoch obtained with gradient descent, for different numbers of trials. The accuracy increases and the variance of the estimate decreases with increasing number of samples. The increase in error following an initial decrease occurs because we measure absolute error, and this behavior corresponds to the estimate overshooting the true value. 50% confidence interval over 20 random seeds. $K=2, f_{max}=0.4, N=100$$(N_1=30, N_2=70)$
  • Figure 4: Comparison of association strength between reading passage complexity metrics and total/unique tokens in the original/latent bag-of-words. Readability indices are BT easiness (BT), Flesch-Reading-Ease (FRE), Flesch-Kincaid-Grade-Level (FKGL), Automated Readability Index (ARI), SMOG Readability (SMOG), New Dale-Chall Readability Formula (DCR), CAREC, CAREC-M, and CML2RI.
  • Figure 5: Measured counts for synthetic spike-in RNA #12 with and without human cells present. Red dashed line is ground-truth count (ideally all measurements would be equal).
  • ...and 8 more figures