How fast can you find a good hypothesis?
Anders Aamand, Maryam Aliakbarpour, Justin Y. Chen, Sandeep Silwal
TL;DR
This work analyzes the hypothesis selection problem for a finite hypothesis class ${\mathcal H}$ with respect to an unknown distribution $P$, focusing on both proper and improper settings. It establishes tight information-theoretic limits: improper mixtures can achieve at best $C=3-2/n$ in expectation, while proper algorithms can reach $C=3$ with domain-independent samples; neither can beat these constants without substantial sample/complexity penalties. The authors introduce a suite of algorithmic techniques, notably semi-distances, Scheffé sets, and a semi-distance threshold framework, to achieve near-linear or subquadratic runtimes in various regimes. They provide fast algorithms in the moderate-probability regime with $C=3$ and sample complexity $O(\log(n/\delta)/\varepsilon^2)$, and extend to high-probability guarantees with preprocessing and known-OPT upper bounds, including subquadratic-time preprocessing strategies via geometric dimensionality reductions. The results significantly advance the computational efficiency of hypothesis selection, offering practical subquadratic and near-linear methods while preserving optimal or near-optimal statistical guarantees, and they clarify the limits of using mixtures for approximation improvements.
Abstract
In the hypothesis selection problem, we are given sample and query access to finite set of candidate distributions (hypotheses), $\mathcal{H} = \{H_1, \ldots, H_n\}$, and samples from an unknown distribution $P$, both over a domain $\mathcal{X}$. The goal is to output a distribution $Q$ whose distance to $P$ is comparable to that of the nearest hypothesis in $\mathcal{H}$. Specifically, if the minimum distance is $\mathsf{OPT}$, we aim to output $Q$ such that, with probability at least $1-δ$, its total variation distance to $P$ is at most $C \cdot \mathsf{OPT} + \varepsilon$. The optimal approximation for proper algorithms (where $Q \in \mathcal{H}$) is $C=3$ using $Θ(\log(n/δ)/\varepsilon^2)$ samples from $P$ and for improper algorithms (where $Q$ is not necessarily in $\mathcal{H}$) is $C=2$ using $\tildeΘ(\log(n/δ)/\varepsilon^2)$ samples from $P$. In the improper setting, the algorithm achieving $C=2$ [Bousquet, Braverman, Kol, Efremenko, Moran, FOCS 2021] runs in time which grows polynomially with $|\mathcal{X}|$ -- it does not run in finite time for real-valued distributions. A promising path towards improved runtime is to consider improper algorithms which output a mixture $Q$ of the hypotheses as such a distribution can be represented in $n$ words of memory. We show (1) a lower bound that no algorithm which outputs a mixture can achieve approximation better than $C = 3-2/n$ unless the number of samples is polynomial in $|\mathcal{X}|$, as well as (2) an algorithm which runs in time $\text{poly}(n)$ and achieves the same approximation guarantee. In the proper setting, [Aliakbarpour, Bun, Smith, NeurIPS 2024] provided an algorithm with $C=3$ running in $\tilde{O}(n/(δ^3\varepsilon^3))$ time. We improve this time complexity to $\tilde{O}(n/(δ\varepsilon^2))$, significantly reducing the dependence on the confidence and error parameters.
