How Much is Unseen Depends Chiefly on Information About the Seen

Seongmin Lee; Marcel Böhme

How Much is Unseen Depends Chiefly on Information About the Seen

Seongmin Lee, Marcel Böhme

TL;DR

The paper addresses estimating unseen probability mass in multinomial settings by linking the unseen portion to information about the seen frequencies. It provides a precise decomposition of the expected missing mass $\mathbb{E}[M_k]$ in terms of seen-frequency statistics $f_k(n)$ and a rapidly decaying remainder, and introduces a large class of estimators derived from representations of this expectation. It then develops a minimal-bias estimator $\hat{M}_k^B$ with exponentially decaying bias, and a genetic-algorithm-based approach to discover minimal-MSE, distribution-specific estimators $\hat{M}_k^{\text{Evo}}$ that outperform the Good-Turing estimator across multiple distributions. The experimental results demonstrate substantial MSE reductions (roughly 80% of GT) and high success rates (over 90% in many regimes) for the evolved estimators, while the methodology is openly shared for replication. Overall, the work advances distribution-free estimation by leveraging seen-information through a structured representation search and optimization, yielding practical improvements in estimating unseen mass and related functionals.

Abstract

The missing mass refers to the proportion of data points in an unknown population of classifier inputs that belong to classes not present in the classifier's training data, which is assumed to be a random sample from that unknown population. We find that in expectation the missing mass is entirely determined by the number $f_k$ of classes that do appear in the training data the same number of times and an exponentially decaying error. While this is the first precise characterization of the expected missing mass in terms of the sample, the induced estimator suffers from an impractically high variance. However, our theory suggests a large search space of nearly unbiased estimators that can be searched effectively and efficiently. Hence, we cast distribution-free estimation as an optimization problem to find a distribution-specific estimator with a minimized mean-squared error (MSE), given only the sample. In our experiments, our search algorithm discovers estimators that have a substantially smaller MSE than the state-of-the-art Good-Turing estimator. This holds for over 93% of runs when there are at least as many samples as classes. Our estimators' MSE is roughly 80% of the Good-Turing estimator's.

How Much is Unseen Depends Chiefly on Information About the Seen

TL;DR

in terms of seen-frequency statistics

and a rapidly decaying remainder, and introduces a large class of estimators derived from representations of this expectation. It then develops a minimal-bias estimator

with exponentially decaying bias, and a genetic-algorithm-based approach to discover minimal-MSE, distribution-specific estimators

that outperform the Good-Turing estimator across multiple distributions. The experimental results demonstrate substantial MSE reductions (roughly 80% of GT) and high success rates (over 90% in many regimes) for the evolved estimators, while the methodology is openly shared for replication. Overall, the work advances distribution-free estimation by leveraging seen-information through a structured representation search and optimization, yielding practical improvements in estimating unseen mass and related functionals.

Abstract

of classes that do appear in the training data the same number of times and an exponentially decaying error. While this is the first precise characterization of the expected missing mass in terms of the sample, the induced estimator suffers from an impractically high variance. However, our theory suggests a large search space of nearly unbiased estimators that can be searched effectively and efficiently. Hence, we cast distribution-free estimation as an optimization problem to find a distribution-specific estimator with a minimized mean-squared error (MSE), given only the sample. In our experiments, our search algorithm discovers estimators that have a substantially smaller MSE than the state-of-the-art Good-Turing estimator. This holds for over 93% of runs when there are at least as many samples as classes. Our estimators' MSE is roughly 80% of the Good-Turing estimator's.

Paper Structure (23 sections, 8 theorems, 37 equations, 5 figures, 9 tables, 1 algorithm)

This paper contains 23 sections, 8 theorems, 37 equations, 5 figures, 9 tables, 1 algorithm.

Introduction
Background
Contribution of the Paper
Dependencies Between Frequencies $N_x$
Dependency Among Frequencies
A Large Class of Estimators
Estimator with Exponentially Decaying Bias
Estimation with Minimal MSE as Search Problem
Experiment
Evaluating our Minimal-Bias Estimator
Evaluating our Estimator Discovery Algorithm
Discussion
Comparing the Bias of the Estimators
Bounding the Variance of $\hat{M}_k^{B}$
Constraints for the Coefficients of the $\mathbb{E}\left[M_k\right]$ Representations in the Search Space
...and 8 more sections

Key Result

Theorem 1.1

where $R_{n,k}={n \choose k}(-1)^{n - k} f_{n + 1}(n + 1)$ is the remainder.

Figures (5)

Figure 1: $g_k(n)$ lower triangle matrix
Figure 2: Absolute bias of $\hat{M}_0^B$ and $\hat{M}_0^G$ (a) as a function of $n$ for $k=0$ and (b) as a function of $k$ for $n=2000$ ($S = 1000$, log-scale).
Figure 3: The MSE of an estimator discovered using a sample ($S,n=100,200$) from one distribution (individual boxes) applied to another target distribution (box clusters).
Figure 4: The MSE comparison for the missing mass $M_0$ ($S=100$, $n=100$) for extended samples $X^{cn}$ ($c\in\{2,5,10\}$) between the GT estimator $\hat{M}_0^G$ and the adapted estimator from the evolved estimator $\hat{M}_0^{\text{Evo}}$ for $X^n$. 'Ratio' is the ratio of the MSE ($\mathit{MSE}(\hat{M}_0^{\text{Evo}})/\mathit{MSE}(M_0^G)$) and '$p<.05$' is the result of the (one-sided) Wilcoxon signed-rank test.
Figure 4: Absolute bias of $\hat{M}_0^B$ and $\hat{M}_0^G$ (a,b) as a function of $n$ for $k=0$ and (c) as a function of $k$ for $n=2000$ ($S = 1000$, log-scale).

Theorems & Definitions (15)

Theorem 1.1
Theorem 3.1
Theorem B.1
proof
Lemma B.2
proof
Lemma B.3
proof
Theorem B.4
proof
...and 5 more

How Much is Unseen Depends Chiefly on Information About the Seen

TL;DR

Abstract

How Much is Unseen Depends Chiefly on Information About the Seen

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (15)