Table of Contents
Fetching ...

Near-optimal algorithms for private estimation and sequential testing of collision probability

Robert Busa-Fekete, Umar Syed

TL;DR

This paper studies private estimation and sequential testing of collision probability $C(\mathbf{p})=\sum_i p_i^2$ for discrete distributions. It introduces a non-interactive locally private estimator that uses salted hashing to count hash collisions across $\Theta(n^2)$ pairs, achieving near-optimal sample complexity $\tilde{O}\left(\frac{\log(1/\beta)}{\alpha^2 \varepsilon^2}\right)$ for $\alpha\le 1$, improving previous bounds by a factor $1/\alpha^2$. It also develops a sequential testing algorithm that distinguishes $C(\mathbf{p})=c_0$ from $|C(\mathbf{p})-c_0|\ge\varepsilon$ with $\tilde{O}\left(\frac{1}{\varepsilon^2}\right)$ samples, adapting automatically to unknown $\varepsilon$; a private sequential tester variant (PSQ) combines these ideas under privacy constraints. The work provides matching lower bounds (up to logarithmic factors) and demonstrates substantial practical sample reductions in experiments compared to prior methods. Overall, the approaches directly exploit the $\Theta(n^2)$ potential collisions in $n$ samples to improve estimation and testing efficiency under local privacy.

Abstract

We present new algorithms for estimating and testing \emph{collision probability}, a fundamental measure of the spread of a discrete distribution that is widely used in many scientific fields. We describe an algorithm that satisfies $(α, β)$-local differential privacy and estimates collision probability with error at most $ε$ using $\tilde{O}\left(\frac{\log(1/β)}{α^2 ε^2}\right)$ samples for $α\le 1$, which improves over previous work by a factor of $\frac{1}{α^2}$. We also present a sequential testing algorithm for collision probability, which can distinguish between collision probability values that are separated by $ε$ using $\tilde{O}(\frac{1}{ε^2})$ samples, even when $ε$ is unknown. Our algorithms have nearly the optimal sample complexity, and in experiments we show that they require significantly fewer samples than previous methods.

Near-optimal algorithms for private estimation and sequential testing of collision probability

TL;DR

This paper studies private estimation and sequential testing of collision probability for discrete distributions. It introduces a non-interactive locally private estimator that uses salted hashing to count hash collisions across pairs, achieving near-optimal sample complexity for , improving previous bounds by a factor . It also develops a sequential testing algorithm that distinguishes from with samples, adapting automatically to unknown ; a private sequential tester variant (PSQ) combines these ideas under privacy constraints. The work provides matching lower bounds (up to logarithmic factors) and demonstrates substantial practical sample reductions in experiments compared to prior methods. Overall, the approaches directly exploit the potential collisions in samples to improve estimation and testing efficiency under local privacy.

Abstract

We present new algorithms for estimating and testing \emph{collision probability}, a fundamental measure of the spread of a discrete distribution that is widely used in many scientific fields. We describe an algorithm that satisfies -local differential privacy and estimates collision probability with error at most using samples for , which improves over previous work by a factor of . We also present a sequential testing algorithm for collision probability, which can distinguish between collision probability values that are separated by using samples, even when is unknown. Our algorithms have nearly the optimal sample complexity, and in experiments we show that they require significantly fewer samples than previous methods.

Paper Structure

This paper contains 32 sections, 20 theorems, 97 equations, 5 figures, 2 algorithms.

Key Result

Theorem 1

Mechanism alg:second satisfies $(\alpha, \beta)$-local differential privacy.

Figures (5)

  • Figure 1: Sample complexity of private collision probability estimation mechanisms for $\alpha = 0.25$. Both mechanisms use the MD5 hash function and confidence level $\delta = 0.1$. For Mechanism 1 we let $\beta = 10^{-5}$. Error bars are one standard error.
  • Figure 2: Sample complexity of our sequential tester (Algorithm \ref{['alg:closeness_simple']}) compared to the sample complexity of das_2017's sequential tester adapted for collision probability testing.
  • Figure 3: Sample complexity of the sequential tester compared to the sample complexity of the batch testers. For the batch testers, the tolerance parameter $\epsilon$ is set to $0.01$.
  • Figure 4: Empirical absolute error of plug-in and U-statistic estimators when the data is generated from uniform distribution and power law with domain size 1000.
  • Figure 5: Sample complexity of private sequential testing algorithms with respect to the non-private estimator. The sample complexity of private sequential testers is divided by the sample complexity of the non-private sequential tester from Figure \ref{['fig:samp_batch']} and the multiplicative factors are shown.

Theorems & Definitions (27)

  • Theorem 1
  • Theorem 2
  • Corollary 1
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • Theorem 7
  • Theorem 8
  • Proposition 1
  • ...and 17 more