Table of Contents
Fetching ...

Nearly-Linear Time Private Hypothesis Selection with the Optimal Approximation Factor

Maryam Aliakbarpour, Zhan Shi, Ria Stevens, Vincent X. Wang

TL;DR

This work develops a central-model differentially private algorithm for hypothesis selection that achieves the optimal approximation factor $\alpha=3$ while running in nearly-linear time in the number of candidate hypotheses $n$. The method builds on a minimum-distance-estimate framework, using empirical semi-distances, lifting, and prompting concepts to iteratively refine a small prompting set $A$ and privately identify useful hypotheses via the Exponential Mechanism and Sparse Vector Technique. The authors prove both privacy (via composition and SVT/Exponential Mechanism guarantees) and correctness (showing that, with high probability, the output satisfies $\|\hat{H}-P\|_{TV} \le 3\cdot{\rm OPT}+\sigma$) with sample complexity $s = \Theta\left(\frac{\log^3(n/\beta)}{\beta^2 \sigma^2 \epsilon}\right)$ and runtime $\tilde{\Theta}\left(\frac{n}{\beta^4 \sigma^3 \epsilon}\right)$. This resolves an open question on the existence of nearly-linear-time private hypothesis selection with the optimal approximation factor, achieving a favorable privacy-accuracy-time trade-off relative to prior quadratic-time results. The framework holds potential for scalable private distribution estimation and agnostic learning in high-dimensional settings where interpretability via a finite hypothesis class is desirable.

Abstract

Estimating the density of a distribution from its samples is a fundamental problem in statistics. Hypothesis selection addresses the setting where, in addition to a sample set, we are given $n$ candidate distributions -- referred to as hypotheses -- and the goal is to determine which one best describes the underlying data distribution. This problem is known to be solvable very efficiently, requiring roughly $O(\log n)$ samples and running in $\tilde{O}(n)$ time. The quality of the output is measured via the total variation distance to the unknown distribution, and the approximation factor of the algorithm determines how large this distance is compared to the optimal distance achieved by the best candidate hypothesis. It is known that $α= 3$ is the optimal approximation factor for this problem. We study hypothesis selection under the constraint of differential privacy. We propose a differentially private algorithm in the central model that runs in nearly-linear time with respect to the number of hypotheses, achieves the optimal approximation factor, and incurs only a modest increase in sample complexity, which remains polylogarithmic in $n$. This resolves an open question posed by [Bun, Kamath, Steinke, Wu, NeurIPS 2019]. Prior to our work, existing upper bounds required quadratic time.

Nearly-Linear Time Private Hypothesis Selection with the Optimal Approximation Factor

TL;DR

This work develops a central-model differentially private algorithm for hypothesis selection that achieves the optimal approximation factor while running in nearly-linear time in the number of candidate hypotheses . The method builds on a minimum-distance-estimate framework, using empirical semi-distances, lifting, and prompting concepts to iteratively refine a small prompting set and privately identify useful hypotheses via the Exponential Mechanism and Sparse Vector Technique. The authors prove both privacy (via composition and SVT/Exponential Mechanism guarantees) and correctness (showing that, with high probability, the output satisfies ) with sample complexity and runtime . This resolves an open question on the existence of nearly-linear-time private hypothesis selection with the optimal approximation factor, achieving a favorable privacy-accuracy-time trade-off relative to prior quadratic-time results. The framework holds potential for scalable private distribution estimation and agnostic learning in high-dimensional settings where interpretability via a finite hypothesis class is desirable.

Abstract

Estimating the density of a distribution from its samples is a fundamental problem in statistics. Hypothesis selection addresses the setting where, in addition to a sample set, we are given candidate distributions -- referred to as hypotheses -- and the goal is to determine which one best describes the underlying data distribution. This problem is known to be solvable very efficiently, requiring roughly samples and running in time. The quality of the output is measured via the total variation distance to the unknown distribution, and the approximation factor of the algorithm determines how large this distance is compared to the optimal distance achieved by the best candidate hypothesis. It is known that is the optimal approximation factor for this problem. We study hypothesis selection under the constraint of differential privacy. We propose a differentially private algorithm in the central model that runs in nearly-linear time with respect to the number of hypotheses, achieves the optimal approximation factor, and incurs only a modest increase in sample complexity, which remains polylogarithmic in . This resolves an open question posed by [Bun, Kamath, Steinke, Wu, NeurIPS 2019]. Prior to our work, existing upper bounds required quadratic time.

Paper Structure

This paper contains 53 sections, 14 theorems, 81 equations, 1 table, 3 algorithms.

Key Result

Theorem 2

For every $\epsilon, \beta, \sigma \in (0,1)$, Algorithm alg::wrapper is an $(\alpha = 3, \epsilon, \beta, \sigma)$-proper learner for the private hypothesis selection problem that uses $s = \Theta(\log^3 (n/ \beta) \;\! / \;\! (\beta^2 \sigma^2 \epsilon))$ samples and runs in time $\tilde{\Theta}(n

Theorems & Definitions (32)

  • Definition 1.1: Proper learner for private hypothesis selection
  • Remark 1
  • Theorem 2: Informal version of Theorem \ref{['thm:meta']}
  • Lemma 3.1
  • proof
  • Definition 3.2: Pure differential privacy
  • Definition 3.3: Sensitivity
  • Definition 3.4: Exponential mechanism mcsherryT07dwork2014algorithmic
  • Theorem 3
  • Lemma 5.1
  • ...and 22 more