Table of Contents
Fetching ...

When Focus Enhances Utility: Target Range LDP Frequency Estimation and Unknown Item Discovery

Bo Jiang, Wanrong Zhang, Donghang Lu, Jian Du, Qiang Yan

TL;DR

This paper addresses the challenge of accurate frequency estimation under Local Differential Privacy (LDP) and extends it to data domains that are unknown a priori. It introduces Generalized Count Mean Sketch (GCMS) and Optimal CMS (OCMS) within an Encryption-Shuffling-Analysis (ESA) framework, achieving improved communication efficiency and higher utility under LDP while enabling targeted frequency estimation. For unknown domains, it proposes a stability-based histogram protocol augmented with an auxiliary server that does not access raw messages, delivering accuracy close to central DP with local-like privacy and lower computation. Theoretical analyses and extensive experiments on real datasets demonstrate GCMS%E2%80%99s superior utility compared to CMS, OCMS%E2%80%99s targeted-frequency variance minimization, and practical privacy-utility gains for unknown-domain data collection.

Abstract

Local Differential Privacy (LDP) protocols enable the collection of randomized client messages for data analysis, without the necessity of a trusted data curator. Such protocols have been successfully deployed in real-world scenarios by major tech companies like Google, Apple, and Microsoft. In this paper, we propose a Generalized Count Mean Sketch (GCMS) protocol that captures many existing frequency estimation protocols. Our method significantly improves the three-way trade-offs between communication, privacy, and accuracy. We also introduce a general utility analysis framework that enables optimizing parameter designs. {Based on that, we propose an Optimal Count Mean Sketch (OCMS) framework that minimizes the variance for collecting items with targeted frequencies.} Moreover, we present a novel protocol for collecting data within unknown domain, as our frequency estimation protocols only work effectively with known data domain. Leveraging the stability-based histogram technique alongside the Encryption-Shuffling-Analysis (ESA) framework, our approach employs an auxiliary server to construct histograms without accessing original data messages. This protocol achieves accuracy akin to the central DP model while offering local-like privacy guarantees and substantially lowering computational costs.

When Focus Enhances Utility: Target Range LDP Frequency Estimation and Unknown Item Discovery

TL;DR

This paper addresses the challenge of accurate frequency estimation under Local Differential Privacy (LDP) and extends it to data domains that are unknown a priori. It introduces Generalized Count Mean Sketch (GCMS) and Optimal CMS (OCMS) within an Encryption-Shuffling-Analysis (ESA) framework, achieving improved communication efficiency and higher utility under LDP while enabling targeted frequency estimation. For unknown domains, it proposes a stability-based histogram protocol augmented with an auxiliary server that does not access raw messages, delivering accuracy close to central DP with local-like privacy and lower computation. Theoretical analyses and extensive experiments on real datasets demonstrate GCMS%E2%80%99s superior utility compared to CMS, OCMS%E2%80%99s targeted-frequency variance minimization, and practical privacy-utility gains for unknown-domain data collection.

Abstract

Local Differential Privacy (LDP) protocols enable the collection of randomized client messages for data analysis, without the necessity of a trusted data curator. Such protocols have been successfully deployed in real-world scenarios by major tech companies like Google, Apple, and Microsoft. In this paper, we propose a Generalized Count Mean Sketch (GCMS) protocol that captures many existing frequency estimation protocols. Our method significantly improves the three-way trade-offs between communication, privacy, and accuracy. We also introduce a general utility analysis framework that enables optimizing parameter designs. {Based on that, we propose an Optimal Count Mean Sketch (OCMS) framework that minimizes the variance for collecting items with targeted frequencies.} Moreover, we present a novel protocol for collecting data within unknown domain, as our frequency estimation protocols only work effectively with known data domain. Leveraging the stability-based histogram technique alongside the Encryption-Shuffling-Analysis (ESA) framework, our approach employs an auxiliary server to construct histograms without accessing original data messages. This protocol achieves accuracy akin to the central DP model while offering local-like privacy guarantees and substantially lowering computational costs.

Paper Structure

This paper contains 29 sections, 9 theorems, 48 equations, 9 figures, 6 algorithms.

Key Result

Theorem 1

For local randomizer with for any $\delta\in[0,1]$, shuffling can achieve $(\epsilon_c,\delta)$-DP with where $n$ denotes the number of data items.

Figures (9)

  • Figure 1: Illustration of the optimal Count Mean Sketch with the ESA framework.
  • Figure 2: Choices of $p$ in OUE-LDP, Apple's CMS, and our OCMS. Optimal $p$ depends on the ratio of $n/f(d)$ and $k$, and is independent of $m$.
  • Figure 3: Illustration of the framework of privacy-preserving data collection with unknown domain.
  • Figure 4: Variance comparison between our approach and Apple's CMS under different epsilons. Lines with the same color indicate the same parameter setting. #.bits represents the bit length of the LDP reports, which is directly related to the hash function range $m$. $k$ is the number of hash functions, $n$ is the total number of LDP reports, and $f(d)$ is the true frequency of the queried item. The perturbation parameter $p$ for our approach is $0.5$.
  • Figure 5: Optimal selection of $p$ in various parameter settings. $n = 10000, k = 200, m = 200$.
  • ...and 4 more figures

Theorems & Definitions (17)

  • Definition 1: $(\epsilon,\delta)$-DP DMNS06
  • Theorem 1: Privacy amplification by shuffling feldman2023stronger
  • Remark 1
  • Theorem 2: Privacy of GCMS
  • Theorem 3: Utility of GCMS
  • Corollary 1: Theorem 2 in wang2017locally
  • Lemma 1
  • Remark 2
  • Remark 3
  • Theorem 4: Privacy of Algorithm \ref{['alg:CDP_aux_server']}
  • ...and 7 more