When Focus Enhances Utility: Target Range LDP Frequency Estimation and Unknown Item Discovery
Bo Jiang, Wanrong Zhang, Donghang Lu, Jian Du, Qiang Yan
TL;DR
This paper addresses the challenge of accurate frequency estimation under Local Differential Privacy (LDP) and extends it to data domains that are unknown a priori. It introduces Generalized Count Mean Sketch (GCMS) and Optimal CMS (OCMS) within an Encryption-Shuffling-Analysis (ESA) framework, achieving improved communication efficiency and higher utility under LDP while enabling targeted frequency estimation. For unknown domains, it proposes a stability-based histogram protocol augmented with an auxiliary server that does not access raw messages, delivering accuracy close to central DP with local-like privacy and lower computation. Theoretical analyses and extensive experiments on real datasets demonstrate GCMS%E2%80%99s superior utility compared to CMS, OCMS%E2%80%99s targeted-frequency variance minimization, and practical privacy-utility gains for unknown-domain data collection.
Abstract
Local Differential Privacy (LDP) protocols enable the collection of randomized client messages for data analysis, without the necessity of a trusted data curator. Such protocols have been successfully deployed in real-world scenarios by major tech companies like Google, Apple, and Microsoft. In this paper, we propose a Generalized Count Mean Sketch (GCMS) protocol that captures many existing frequency estimation protocols. Our method significantly improves the three-way trade-offs between communication, privacy, and accuracy. We also introduce a general utility analysis framework that enables optimizing parameter designs. {Based on that, we propose an Optimal Count Mean Sketch (OCMS) framework that minimizes the variance for collecting items with targeted frequencies.} Moreover, we present a novel protocol for collecting data within unknown domain, as our frequency estimation protocols only work effectively with known data domain. Leveraging the stability-based histogram technique alongside the Encryption-Shuffling-Analysis (ESA) framework, our approach employs an auxiliary server to construct histograms without accessing original data messages. This protocol achieves accuracy akin to the central DP model while offering local-like privacy guarantees and substantially lowering computational costs.
