Table of Contents
Fetching ...

Testing Support Size More Efficiently Than Learning Histograms

Renato Ferreira Pinto, Nathaniel Harms

TL;DR

This paper shows that testing whether an unknown distribution has support size at most $n$ can be done with significantly fewer samples than learning the full histogram. It introduces a Poissonized, Chebyshev-polynomial based test statistic that extrapolates to unseen elements and uses a safe interval to control bias, achieving a sample complexity of $m(n,\\varepsilon) = O\left( \\frac{n}{\\varepsilon \\log n} \\min\{ \\log(1/\\varepsilon), \\log n\} \\right)$ and, with refinements, $O\left( \\frac{n}{\\varepsilon \\log n} \\log(1/\\varepsilon) \\right)$. The method leverages shifted, scaled Chebyshev polynomials to approximate the indicator of positive probabilities while bounding the contribution from tiny probabilities outside the safe interval, and includes a detailed optimization of parameters, variance control, and correctness proof. The approach also yields improved lower bounds on the possible support size from a given sample set and establishes an equivalence between testing support size for distributions and for Boolean functions, linking testing vs. learning questions across domains. Overall, the work closes part of the gap between lower and upper bounds for distribution testing of support size and provides a self-contained exposition of the Chebyshev polynomial method in this context.

Abstract

Consider two problems about an unknown probability distribution $p$: 1. How many samples from $p$ are required to test if $p$ is supported on $n$ elements or not? Specifically, given samples from $p$, determine whether it is supported on at most $n$ elements, or it is "$ε$-far" (in total variation distance) from being supported on $n$ elements. 2. Given $m$ samples from $p$, what is the largest lower bound on its support size that we can produce? The best known upper bound for problem (1) uses a general algorithm for learning the histogram of the distribution $p$, which requires $Θ(\tfrac{n}{ε^2 \log n})$ samples. We show that testing can be done more efficiently than learning the histogram, using only $O(\tfrac{n}{ε\log n} \log(1/ε))$ samples, nearly matching the best known lower bound of $Ω(\tfrac{n}{ε\log n})$. This algorithm also provides a better solution to problem (2), producing larger lower bounds on support size than what follows from previous work. The proof relies on an analysis of Chebyshev polynomial approximations outside the range where they are designed to be good approximations, and the paper is intended as an accessible self-contained exposition of the Chebyshev polynomial method.

Testing Support Size More Efficiently Than Learning Histograms

TL;DR

This paper shows that testing whether an unknown distribution has support size at most can be done with significantly fewer samples than learning the full histogram. It introduces a Poissonized, Chebyshev-polynomial based test statistic that extrapolates to unseen elements and uses a safe interval to control bias, achieving a sample complexity of and, with refinements, . The method leverages shifted, scaled Chebyshev polynomials to approximate the indicator of positive probabilities while bounding the contribution from tiny probabilities outside the safe interval, and includes a detailed optimization of parameters, variance control, and correctness proof. The approach also yields improved lower bounds on the possible support size from a given sample set and establishes an equivalence between testing support size for distributions and for Boolean functions, linking testing vs. learning questions across domains. Overall, the work closes part of the gap between lower and upper bounds for distribution testing of support size and provides a self-contained exposition of the Chebyshev polynomial method in this context.

Abstract

Consider two problems about an unknown probability distribution : 1. How many samples from are required to test if is supported on elements or not? Specifically, given samples from , determine whether it is supported on at most elements, or it is "-far" (in total variation distance) from being supported on elements. 2. Given samples from , what is the largest lower bound on its support size that we can produce? The best known upper bound for problem (1) uses a general algorithm for learning the histogram of the distribution , which requires samples. We show that testing can be done more efficiently than learning the histogram, using only samples, nearly matching the best known lower bound of . This algorithm also provides a better solution to problem (2), producing larger lower bounds on support size than what follows from previous work. The proof relies on an analysis of Chebyshev polynomial approximations outside the range where they are designed to be good approximations, and the paper is intended as an accessible self-contained exposition of the Chebyshev polynomial method.

Paper Structure

This paper contains 24 sections, 40 theorems, 148 equations, 5 figures.

Key Result

Theorem 1.3

For all $n \in \mathbb{N}$ and $\varepsilon \in (0,1)$, the sample complexity of testing support size of an unknown distribution $p$ (over any countable domain) is at most

Figures (5)

  • Figure 1: The polynomial $T_{11}(x)$ and the resulting $Q(p_i) = 1 + e^{-mp_i} P_d(p_i)$ for $d=11$ and certain choices of $\ell, r, n, m$. The 'safe' interval $[\ell, r]$ is between the two vertical lines in \ref{['fig:cheb-example-b']}. See that $Q(p_i)$ is an approximation of the 'idealized' function $Q(p_i) = \mathds{1} \left[ p_i > 0 \right]$.
  • Figure 2: Untamed right tail.
  • Figure 3: $Q(p_i)$ for $p_i < \ell$, and the linear lower bound $Q(p_i) \geq (1-\delta)\tfrac{p_i}{\ell}$ in \ref{['res:q-light-bounds']}
  • Figure 4: The function $\Phi(\lambda)$ with bad and good parameters.
  • Figure 5: Example values of $1+f(\bm{N_i})$ in the estimator.

Theorems & Definitions (96)

  • Example 1.1
  • Definition 1.2: Testing Support Size
  • Theorem 1.3
  • Definition 1.5: Effective support size
  • Corollary 1.6
  • Definition 1.7: Distribution-free sample-based testing; see formal \ref{['def:testing-functions']}
  • Theorem 1.8: Equivalence of testing support size for distributions and functions
  • Proposition 2.2
  • proof
  • Claim 2.3
  • ...and 86 more