Table of Contents
Fetching ...

Statistical-Computational Trade-offs for Density Estimation

Anders Aamand, Alexandr Andoni, Justin Y. Chen, Piotr Indyk, Shyam Narayanan, Sandeep Silwal, Haike Xu

TL;DR

A lower bound is shown that, for a broad class of data structures, their bounds cannot be significantly improved, demonstrating that any data structure must use close to a linear number of samples or take close to linear query time.

Abstract

We study the density estimation problem defined as follows: given $k$ distributions $p_1, \ldots, p_k$ over a discrete domain $[n]$, as well as a collection of samples chosen from a ``query'' distribution $q$ over $[n]$, output $p_i$ that is ``close'' to $q$. Recently~\cite{aamand2023data} gave the first and only known result that achieves sublinear bounds in {\em both} the sampling complexity and the query time while preserving polynomial data structure space. However, their improvement over linear samples and time is only by subpolynomial factors. Our main result is a lower bound showing that, for a broad class of data structures, their bounds cannot be significantly improved. In particular, if an algorithm uses $O(n/\log^c k)$ samples for some constant $c>0$ and polynomial space, then the query time of the data structure must be at least $k^{1-O(1)/\log \log k}$, i.e., close to linear in the number of distributions $k$. This is a novel \emph{statistical-computational} trade-off for density estimation, demonstrating that any data structure must use close to a linear number of samples or take close to linear query time. The lower bound holds even in the realizable case where $q=p_i$ for some $i$, and when the distributions are flat (specifically, all distributions are uniform over half of the domain $[n]$). We also give a simple data structure for our lower bound instance with asymptotically matching upper bounds. Experiments show that the data structure is quite efficient in practice.

Statistical-Computational Trade-offs for Density Estimation

TL;DR

A lower bound is shown that, for a broad class of data structures, their bounds cannot be significantly improved, demonstrating that any data structure must use close to a linear number of samples or take close to linear query time.

Abstract

We study the density estimation problem defined as follows: given distributions over a discrete domain , as well as a collection of samples chosen from a ``query'' distribution over , output that is ``close'' to . Recently~\cite{aamand2023data} gave the first and only known result that achieves sublinear bounds in {\em both} the sampling complexity and the query time while preserving polynomial data structure space. However, their improvement over linear samples and time is only by subpolynomial factors. Our main result is a lower bound showing that, for a broad class of data structures, their bounds cannot be significantly improved. In particular, if an algorithm uses samples for some constant and polynomial space, then the query time of the data structure must be at least , i.e., close to linear in the number of distributions . This is a novel \emph{statistical-computational} trade-off for density estimation, demonstrating that any data structure must use close to a linear number of samples or take close to linear query time. The lower bound holds even in the realizable case where for some , and when the distributions are flat (specifically, all distributions are uniform over half of the domain ). We also give a simple data structure for our lower bound instance with asymptotically matching upper bounds. Experiments show that the data structure is quite efficient in practice.

Paper Structure

This paper contains 13 sections, 17 theorems, 31 equations, 2 figures, 1 table, 2 algorithms.

Key Result

Theorem 3.1

If a list-of-points data structure solves the $\mathrm{URDE}\left(\frac{1}{2},s\right)$ using time $O(k^{\rho_q})$ and space $O(k^{1+\rho_u})$, and succeeds with probability at least $0.99$, then for sufficiently large $s$, $\rho_q\ge 1-\frac{1}{s^{1-\log 2-o(1)}}-\frac{\rho_u}{\log s-1}$.

Figures (2)

  • Figure 1: Left: Trade-off between $1/s$ (samples as a fraction of $n$) and the query time exponent $\rho_q$ for our algorithm for half-uniform distributions (solid green curve), the algorithm by aamand2023data for general distributions (dashed green curve), our analytic lower bound (solid red curve), and a numerical evaluation of the bound from \ref{['thm:supermajority_lowerbound']} (dashed black curve). We have fixed the space parameter $\rho_u=1/2$. The plots illustrate the asymptotic behaviour proven in \ref{['thm:lower_bound_FNNS']} and \ref{['thm:main-upper']} that as $s\to \infty$, $\rho_q=1-\Theta(1/\log s)$ both in the lower bound and for our algorithm for half-uniform distributions. Right: The same plot zoomed in to the upper left corner with $1/s$ on log-scale.
  • Figure 2: Comparison of efficiency of the Subset (Ours) and Elimination algorithms as (a): the number of distributions $k$ varies. Other parameters are set to $n=500, S=50, \ell=3$. (b): the domain size $n$ varies. Other parameters are set to $k=50000, S=50, \ell=3$. (c): the number of samples $S$ varies. Other parameters are set to $k=50000, n=500, \ell=3$. (d): the subset size $\ell$ varies. Other parameters are set to $k=50000, n=500, S=50$.

Theorems & Definitions (39)

  • Definition 2.1: Uniform random density estimation problem
  • Definition 2.2: Random $\mathrm{GapSS}$ problem ahle2020subsets
  • Definition 2.3: List-of-points model
  • Theorem 3.1: Lower bound for $\mathrm{URDE}$
  • Theorem 3.2: Lower bound for random GapSS, ahle2020subsets
  • Theorem 3.3: Reduction from random $\mathrm{GapSS}$ to $\mathrm{URDE}$
  • proof
  • Theorem 3.4: Explicit lower bound for random GapSS instance
  • proof : Proof of Theorem \ref{['thm:lower_bound_FNNS']}
  • Remark 3.5
  • ...and 29 more