Estimating the size of a set using cascading exclusion

Sourav Chatterjee; Persi Diaconis; Susan Holmes

Estimating the size of a set using cascading exclusion

Sourav Chatterjee, Persi Diaconis, Susan Holmes

Abstract

Let $S$ be a finite set, and $X_1,\ldots,X_n$ an i.i.d. uniform sample from $S$. To estimate the size $|S|$, without further structure, one can wait for repeats and use the birthday problem. This requires a sample size of the order $|S|^\frac{1}{2}$. On the other hand, if $S=\{1,2,\ldots,|S|\}$, the maximum of the sample blown up by $n/(n-1)$ gives an efficient estimator based on any growing sample size. This paper gives refinements that interpolate between these extremes. A general non-asymptotic theory is developed. This includes estimating the volume of a compact convex set, the unseen species problem, and a host of testing problems that follow from the question `Is this new observation a typical pick from a large prespecified population?' We also treat regression style predictors. A general theorem gives non-parametric finite $n$ error bounds in all cases.

Estimating the size of a set using cascading exclusion

Abstract

Let

be a finite set, and

an i.i.d. uniform sample from

. To estimate the size

, without further structure, one can wait for repeats and use the birthday problem. This requires a sample size of the order

. On the other hand, if

, the maximum of the sample blown up by

gives an efficient estimator based on any growing sample size. This paper gives refinements that interpolate between these extremes. A general non-asymptotic theory is developed. This includes estimating the volume of a compact convex set, the unseen species problem, and a host of testing problems that follow from the question `Is this new observation a typical pick from a large prespecified population?' We also treat regression style predictors. A general theorem gives non-parametric finite

error bounds in all cases.

Estimating the size of a set using cascading exclusion

Abstract

Estimating the size of a set using cascading exclusion

Abstract

Paper Structure

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (42)