Table of Contents
Fetching ...

Locally Private Histograms in All Privacy Regimes

Clément L. Canonne, Abigail Gentle

TL;DR

A protocol for histograms in the \emph{shuffle} model of differential privacy is obtained, with accuracy matching previous algorithms but significantly better message and communication complexity.

Abstract

Frequency estimation, a.k.a. histograms, is a workhorse of data analysis, and as such has been thoroughly studied under differentially privacy. In particular, computing histograms in the \emph{local} model of privacy has been the focus of a fruitful recent line of work, and various algorithms have been proposed, achieving the order-optimal $\ell_\infty$ error in the high-privacy (small $\varepsilon$) regime while balancing other considerations such as time- and communication-efficiency. However, to the best of our knowledge, the picture is much less clear when it comes to the medium- or low-privacy regime (large $\varepsilon$), despite its increased relevance in practice. In this paper, we investigate locally private histograms, and the very related distribution learning task, in this medium-to-low privacy regime, and establish near-tight (and somewhat unexpected) bounds on the $\ell_\infty$ error achievable. As a direct corollary of our results, we obtain a protocol for histograms in the \emph{shuffle} model of differential privacy, with accuracy matching previous algorithms but significantly better message and communication complexity. Our theoretical findings emerge from a novel analysis, which appears to improve bounds across the board for the locally private histogram problem. We back our theoretical findings by an empirical comparison of existing algorithms in all privacy regimes, to assess their typical performance and behaviour beyond the worst-case setting.

Locally Private Histograms in All Privacy Regimes

TL;DR

A protocol for histograms in the \emph{shuffle} model of differential privacy is obtained, with accuracy matching previous algorithms but significantly better message and communication complexity.

Abstract

Frequency estimation, a.k.a. histograms, is a workhorse of data analysis, and as such has been thoroughly studied under differentially privacy. In particular, computing histograms in the \emph{local} model of privacy has been the focus of a fruitful recent line of work, and various algorithms have been proposed, achieving the order-optimal error in the high-privacy (small ) regime while balancing other considerations such as time- and communication-efficiency. However, to the best of our knowledge, the picture is much less clear when it comes to the medium- or low-privacy regime (large ), despite its increased relevance in practice. In this paper, we investigate locally private histograms, and the very related distribution learning task, in this medium-to-low privacy regime, and establish near-tight (and somewhat unexpected) bounds on the error achievable. As a direct corollary of our results, we obtain a protocol for histograms in the \emph{shuffle} model of differential privacy, with accuracy matching previous algorithms but significantly better message and communication complexity. Our theoretical findings emerge from a novel analysis, which appears to improve bounds across the board for the locally private histogram problem. We back our theoretical findings by an empirical comparison of existing algorithms in all privacy regimes, to assess their typical performance and behaviour beyond the worst-case setting.
Paper Structure (32 sections, 24 theorems, 87 equations, 4 figures, 1 table)

This paper contains 32 sections, 24 theorems, 87 equations, 4 figures, 1 table.

Key Result

Proposition 1

Let $A$ be any locally private protocol for frequency estimation with expected $\ell_{\infty}$ error ${O\mleft( \sqrt{{\log k}/{(n\varepsilon\xspace^2)}} \mright)}$ for $\varepsilon\xspace \leq 1$, using $\ell$ bits of communication per user. Then there is a locally private protocol $A'$ achieving e for all $\varepsilon\xspace > 0$, using $\ell\left\lceil \varepsilon\xspace \right\rceil$ bits of o

Figures (4)

  • Figure 1: Maximum error ($\ell_{\infty}$) in $x\%$ of 1000 runs. $\varepsilon\xspace=5$, $k=5000$, $n=2000$, and the distribution is a point-mass at the first index. Horizontal lines indicate upper and lower bounds on the expected maximum explored in this paper. The improved bound for RAPPOR is the one given by use of the exact sub-Gaussian parameter.
  • Figure 2: $\ell_{\infty}$ error in $x\%$ of runs, with 1000 repeats per protocol. Distributions are $\operatorname{Zipf}(\alpha)$, where $p_i\propto i^{-\alpha}$, larger values of $\alpha$ give more concentrated distributions, $\alpha=2000$ has its entire mass on a single point, while $\alpha=0$ is the uniform distribution. All experiments with $\varepsilon\xspace=5$, $k=500$, $n=1000$.
  • Figure 3: Log--log plot of median $\ell_{\infty}$ error with upper and lower quartiles, by $\varepsilon\xspace$. Lower and upper bounds discussed in this work included for comparison. No normalisation or clipping has been applied leading to $\ell_{\infty}>2$ in the high privacy regime.
  • Figure 4: Log--log plot of some bounds discussed in this paper, demonstrating a transition to the local Glivenko Cantelli bound in the intermediate privacy regime. This privacy regime was too computationally expensive to simulate, however larger values of $k$ seem to be of interest as the shuffle model allows for larger values of $\varepsilon\xspace$ to compensate for the logarithmic error loss in the alphabet size. In addition some literature suggests that domain reduction tools may introduce an impractical amount of noise erlingsson2020.

Theorems & Definitions (40)

  • Proposition 1: Informal; see \ref{['prop:generic:transformation']}
  • Theorem 1: Informal; see \ref{['theo:rappor:improved']}
  • Theorem 2: Informal; see \ref{['theo:optimal:rappor:ub']}
  • Theorem 3: Informal; see \ref{['theo:optimal:pgr:ub']}
  • Theorem 4: Informal; see \ref{['thm:expected-max-lb']}
  • Theorem 5: Informal; see \ref{['theo:pgr:shuffle']}
  • Corollary 1
  • Definition 1: Locally private randomiser
  • Proposition 2
  • Corollary 2
  • ...and 30 more