Table of Contents
Fetching ...

Estimating the size of a set using cascading exclusion

Sourav Chatterjee, Persi Diaconis, Susan Holmes

Abstract

Let $S$ be a finite set, and $X_1,\ldots,X_n$ an i.i.d. uniform sample from $S$. To estimate the size $|S|$, without further structure, one can wait for repeats and use the birthday problem. This requires a sample size of the order $|S|^\frac{1}{2}$. On the other hand, if $S=\{1,2,\ldots,|S|\}$, the maximum of the sample blown up by $n/(n-1)$ gives an efficient estimator based on any growing sample size. This paper gives refinements that interpolate between these extremes. A general non-asymptotic theory is developed. This includes estimating the volume of a compact convex set, the unseen species problem, and a host of testing problems that follow from the question `Is this new observation a typical pick from a large prespecified population?' We also treat regression style predictors. A general theorem gives non-parametric finite $n$ error bounds in all cases.

Estimating the size of a set using cascading exclusion

Abstract

Let be a finite set, and an i.i.d. uniform sample from . To estimate the size , without further structure, one can wait for repeats and use the birthday problem. This requires a sample size of the order . On the other hand, if , the maximum of the sample blown up by gives an efficient estimator based on any growing sample size. This paper gives refinements that interpolate between these extremes. A general non-asymptotic theory is developed. This includes estimating the volume of a compact convex set, the unseen species problem, and a host of testing problems that follow from the question `Is this new observation a typical pick from a large prespecified population?' We also treat regression style predictors. A general theorem gives non-parametric finite error bounds in all cases.

Paper Structure

This paper contains 25 sections, 20 theorems, 189 equations, 10 figures.

Key Result

Theorem 1.1

Let $(S, \mathcal{S})$ be a measurable space and let $X_1,\ldots,X_n$ be i.i.d. $S$-valued random variables with law $\mu$, where $n\ge 3$. Let $A:S^n \to 2^S$, $A':S^{n-1} \to 2^S$ and $A" :S^{n-2}\to 2^S$ be three symmetric set-valued maps that are measurable in the above sense. Define Then

Figures (10)

  • Figure 1: Convex hull of 100 randomly generated points uniformly distributed in the rectangle $[0,4] \times [0,2]$. The light blue shaded region represents the convex hull, while red points indicate the extreme points (vertices of the convex hull).
  • Figure 2: Verification of the MSE bound from Theorem \ref{['convthm']}: $\mathbb{E}[(V_n/n - D_{n})^2] \leq (6d+7)/n$. The empirical mean squared error is shown for four distributions (uniform rectangle and disk, independent Gaussian, and correlated Gaussian with $r=0.8$) in dimensions $d=2$ and $d=3$. The black dashed line represents the theoretical upper bound. All empirical values fall well below the bound across different probability measures.
  • Figure 3: Verification of the second bound from Theorem \ref{['convthm']}: $\mathbb{E}[|D_n - D_{n-1}|] \leq (d+1)/n$. The empirical expectation of the absolute difference between consecutive $D$ values is plotted against sample size for the same four distributions and dimensions. The black dashed line shows the theoretical bound $(d+1)/n$. The logarithmic scale emphasizes the $O(1/n)$ convergence rate predicted by the theory.
  • Figure 4: Illustrations of the volume estimator $\widehat{\mathrm{vol}(K)}$ from corollary \ref{['convcor']}. The ratio $\mathbb{E}[\widehat{\mathrm{vol}(K)}/\mathrm{vol}(K)]$ approaches 1 (unbiased estimation) as sample size increases for unit cubes, unit balls, triangles, and tetrahedra in two and three dimensions. This convergence supports the validity of the volume estimation approach underlying the corollary's error bound.
  • Figure 5: Three examples of partially ordered sets. A line connecting two vertices indicates that they are comparable, with the vertex positioned higher in the picture being greater than the lower in the partial ordering.
  • ...and 5 more figures

Theorems & Definitions (42)

  • Theorem 1.1
  • Theorem 2.1
  • Lemma 2.2
  • proof
  • proof : Proof of Theorem \ref{['unseencor2']}
  • Lemma 2.3
  • proof
  • Theorem 2.4
  • proof
  • Corollary 2.5
  • ...and 32 more