Table of Contents
Fetching ...

How fast can you find a good hypothesis?

Anders Aamand, Maryam Aliakbarpour, Justin Y. Chen, Sandeep Silwal

TL;DR

This work analyzes the hypothesis selection problem for a finite hypothesis class ${\mathcal H}$ with respect to an unknown distribution $P$, focusing on both proper and improper settings. It establishes tight information-theoretic limits: improper mixtures can achieve at best $C=3-2/n$ in expectation, while proper algorithms can reach $C=3$ with domain-independent samples; neither can beat these constants without substantial sample/complexity penalties. The authors introduce a suite of algorithmic techniques, notably semi-distances, Scheffé sets, and a semi-distance threshold framework, to achieve near-linear or subquadratic runtimes in various regimes. They provide fast algorithms in the moderate-probability regime with $C=3$ and sample complexity $O(\log(n/\delta)/\varepsilon^2)$, and extend to high-probability guarantees with preprocessing and known-OPT upper bounds, including subquadratic-time preprocessing strategies via geometric dimensionality reductions. The results significantly advance the computational efficiency of hypothesis selection, offering practical subquadratic and near-linear methods while preserving optimal or near-optimal statistical guarantees, and they clarify the limits of using mixtures for approximation improvements.

Abstract

In the hypothesis selection problem, we are given sample and query access to finite set of candidate distributions (hypotheses), $\mathcal{H} = \{H_1, \ldots, H_n\}$, and samples from an unknown distribution $P$, both over a domain $\mathcal{X}$. The goal is to output a distribution $Q$ whose distance to $P$ is comparable to that of the nearest hypothesis in $\mathcal{H}$. Specifically, if the minimum distance is $\mathsf{OPT}$, we aim to output $Q$ such that, with probability at least $1-δ$, its total variation distance to $P$ is at most $C \cdot \mathsf{OPT} + \varepsilon$. The optimal approximation for proper algorithms (where $Q \in \mathcal{H}$) is $C=3$ using $Θ(\log(n/δ)/\varepsilon^2)$ samples from $P$ and for improper algorithms (where $Q$ is not necessarily in $\mathcal{H}$) is $C=2$ using $\tildeΘ(\log(n/δ)/\varepsilon^2)$ samples from $P$. In the improper setting, the algorithm achieving $C=2$ [Bousquet, Braverman, Kol, Efremenko, Moran, FOCS 2021] runs in time which grows polynomially with $|\mathcal{X}|$ -- it does not run in finite time for real-valued distributions. A promising path towards improved runtime is to consider improper algorithms which output a mixture $Q$ of the hypotheses as such a distribution can be represented in $n$ words of memory. We show (1) a lower bound that no algorithm which outputs a mixture can achieve approximation better than $C = 3-2/n$ unless the number of samples is polynomial in $|\mathcal{X}|$, as well as (2) an algorithm which runs in time $\text{poly}(n)$ and achieves the same approximation guarantee. In the proper setting, [Aliakbarpour, Bun, Smith, NeurIPS 2024] provided an algorithm with $C=3$ running in $\tilde{O}(n/(δ^3\varepsilon^3))$ time. We improve this time complexity to $\tilde{O}(n/(δ\varepsilon^2))$, significantly reducing the dependence on the confidence and error parameters.

How fast can you find a good hypothesis?

TL;DR

This work analyzes the hypothesis selection problem for a finite hypothesis class with respect to an unknown distribution , focusing on both proper and improper settings. It establishes tight information-theoretic limits: improper mixtures can achieve at best in expectation, while proper algorithms can reach with domain-independent samples; neither can beat these constants without substantial sample/complexity penalties. The authors introduce a suite of algorithmic techniques, notably semi-distances, Scheffé sets, and a semi-distance threshold framework, to achieve near-linear or subquadratic runtimes in various regimes. They provide fast algorithms in the moderate-probability regime with and sample complexity , and extend to high-probability guarantees with preprocessing and known-OPT upper bounds, including subquadratic-time preprocessing strategies via geometric dimensionality reductions. The results significantly advance the computational efficiency of hypothesis selection, offering practical subquadratic and near-linear methods while preserving optimal or near-optimal statistical guarantees, and they clarify the limits of using mixtures for approximation improvements.

Abstract

In the hypothesis selection problem, we are given sample and query access to finite set of candidate distributions (hypotheses), , and samples from an unknown distribution , both over a domain . The goal is to output a distribution whose distance to is comparable to that of the nearest hypothesis in . Specifically, if the minimum distance is , we aim to output such that, with probability at least , its total variation distance to is at most . The optimal approximation for proper algorithms (where ) is using samples from and for improper algorithms (where is not necessarily in ) is using samples from . In the improper setting, the algorithm achieving [Bousquet, Braverman, Kol, Efremenko, Moran, FOCS 2021] runs in time which grows polynomially with -- it does not run in finite time for real-valued distributions. A promising path towards improved runtime is to consider improper algorithms which output a mixture of the hypotheses as such a distribution can be represented in words of memory. We show (1) a lower bound that no algorithm which outputs a mixture can achieve approximation better than unless the number of samples is polynomial in , as well as (2) an algorithm which runs in time and achieves the same approximation guarantee. In the proper setting, [Aliakbarpour, Bun, Smith, NeurIPS 2024] provided an algorithm with running in time. We improve this time complexity to , significantly reducing the dependence on the confidence and error parameters.

Paper Structure

This paper contains 48 sections, 37 theorems, 116 equations, 3 figures, 1 table, 6 algorithms.

Key Result

Theorem 1.1

Consider a sample size $s$ which is a function of $n, \varepsilon, \delta$. There exists a sufficiently large, finite domain size $|\mathcal{X}|$ such that no randomized algorithm with $s$ samples can output a convex combination of the hypotheses with expected approximation factor less than $3 - \fr

Figures (3)

  • Figure 1: Visualization of hard instance with hypotheses $H_i$, $H_{i'}$ and distribution $P_i \in \text{support}(\mathcal{S}_i)$ with $i=2$ and $i'=1$. The domain $\mathcal{X}$ is partitioned into intervals $T_1, \ldots, T_{2n}$, each subdivided into $k$ sub-intervals $T_i^j$ of length $\ell$. Hypotheses $H_i$ and $H_{i'}$ are mostly uniform but assign slightly higher mass $(1+\beta)/|\mathcal{X}|$ to $T_{2i-1}$ or $T_{2i'-1}$ and slightly lower mass $(1-\beta)/|\mathcal{X}|$ to $T_{2i}$ or $T_{2i'}$ respectively. $P_i$ mostly agrees with $H_i$ but exhibits fine-grained structure within sub-intervals: on each $T_{2i-1}^j$, one element is chosen at random to be a sink with 0 mass; on each $T_{2i}^j$, one element is chosen at random to be a spike with mass $2/|\mathcal{X}|$.
  • Figure 2: A node $i$ in $[n]$ is $\alpha$-prompting if it has edges to at least a $\alpha$-fraction of the nodes of $V$. (a) In the case where nodes in $[n]$ are $\gg\beta$ prompting on average, a $\beta$-prompting $i$ can be found through a sampling procedure which samples nodes of $V$. (b) When the nodes in $[n]$ are only $O(\beta)$ prompting on average, we can instead sample a set $S$ of size $O(\log n)$ from $V$ (blue), for $j\in S$, identify the set $T_j$ in $[n]$ with an edge to $j$, and test how prompting nodes in $T_j$ are. If we find a $\beta$-prompting hypothesis, we return it. On a high level (the concrete argument is more finicky), if $i^*$ is $\beta$-prompting and has an edge to a node in $S$, we will return some $\beta$-prompting hypothesis (not necessarily $i^*$). On the other hand, if $i^*$ is not $\beta$-prompting, the probability that $i^*$ has an edge to a node of the sampled set $S$ is $O(\beta \log n)$, so in this case, if we don't find a $\beta$-prompting hypothesis, we can instead return an arbitrary node $j$ from $S$ and with probability $1-O(\beta \log n)$, there is no edge from $i^*$ to $j$.
  • Figure 3: The rounding procedure takes values $\{p_i\}_{i=1}^n$ and outputs a valid distribution $\{q_i\}_{i=1}^n$.

Theorems & Definitions (76)

  • Theorem 1.1: Informal version of \ref{['cor:proper-lb', 'cor:exp-lower-bound']}
  • Theorem 1.2
  • Theorem 1.3
  • Theorem 1.4
  • Theorem 1.5
  • Definition 2.1: Scheffé set devroye2001combinatorial
  • Definition 2.2: Semi-distances devroye2001combinatorial
  • Proposition 2.3: Underestimation
  • Proposition 2.4: Semi-Distance TV Approximation
  • Proposition 2.5: Approximating Semi-Distances via Samples
  • ...and 66 more