Near-optimal algorithms for private estimation and sequential testing of collision probability
Robert Busa-Fekete, Umar Syed
TL;DR
This paper studies private estimation and sequential testing of collision probability $C(\mathbf{p})=\sum_i p_i^2$ for discrete distributions. It introduces a non-interactive locally private estimator that uses salted hashing to count hash collisions across $\Theta(n^2)$ pairs, achieving near-optimal sample complexity $\tilde{O}\left(\frac{\log(1/\beta)}{\alpha^2 \varepsilon^2}\right)$ for $\alpha\le 1$, improving previous bounds by a factor $1/\alpha^2$. It also develops a sequential testing algorithm that distinguishes $C(\mathbf{p})=c_0$ from $|C(\mathbf{p})-c_0|\ge\varepsilon$ with $\tilde{O}\left(\frac{1}{\varepsilon^2}\right)$ samples, adapting automatically to unknown $\varepsilon$; a private sequential tester variant (PSQ) combines these ideas under privacy constraints. The work provides matching lower bounds (up to logarithmic factors) and demonstrates substantial practical sample reductions in experiments compared to prior methods. Overall, the approaches directly exploit the $\Theta(n^2)$ potential collisions in $n$ samples to improve estimation and testing efficiency under local privacy.
Abstract
We present new algorithms for estimating and testing \emph{collision probability}, a fundamental measure of the spread of a discrete distribution that is widely used in many scientific fields. We describe an algorithm that satisfies $(α, β)$-local differential privacy and estimates collision probability with error at most $ε$ using $\tilde{O}\left(\frac{\log(1/β)}{α^2 ε^2}\right)$ samples for $α\le 1$, which improves over previous work by a factor of $\frac{1}{α^2}$. We also present a sequential testing algorithm for collision probability, which can distinguish between collision probability values that are separated by $ε$ using $\tilde{O}(\frac{1}{ε^2})$ samples, even when $ε$ is unknown. Our algorithms have nearly the optimal sample complexity, and in experiments we show that they require significantly fewer samples than previous methods.
