Differentially Private Clustering in Data Streams
Alessandro Epasto, Tamalika Mukherjee, Peilin Zhong
TL;DR
This work addresses the problem of differentially private clustering for streaming data, specifically $k$-means and $k$-median, under the continual release model with insertions only. It introduces a general DP clustering framework that reduces streaming DP clustering to offline DP coresets/clustering that can be plugged in as blackboxes, and it mitigates additive error growth by partitioning the data space into groups using a bicriteria center set and applying a DP merge-and-reduce framework within rings around these centers. The authors achieve two main results: (i) an $O(1)$-multiplicative approximation with sublinear space $\tilde{O}(k^{1.5} \cdot \text{poly}(d, \log T))$ and poly$(k,d,\log T)$ additive error, and (ii) a $(1+\gamma)$-multiplicative approximation with $\tilde{O}_{\gamma}(\text{poly}(k,2^{O_{\gamma}(d)}, \log T))$ space and poly$(k,2^{O_{\gamma}(d)}, \log T)$ additive error, for any $\gamma>0$. The framework supports using existing offline DP clustering or coreset algorithms as blackboxes, enabling direct transfer of offline DP guarantees to streaming settings. This work provides the first sublinear-space DP clustering in streams and offers a path toward practical, privacy-preserving, continual-release clustering for large-scale data. The results have significant implications for privacy-aware streaming analytics and public-se API accessibility for DP clustering components.
Abstract
Clustering problems (such as $k$-means and $k$-median) are fundamental unsupervised machine learning primitives, and streaming clustering algorithms have been extensively studied in the past. However, since data privacy becomes a central concern in many real-world applications, non-private clustering algorithms may not be as applicable in many scenarios. In this work, we provide the first differentially private algorithms for $k$-means and $k$-median clustering of $d$-dimensional Euclidean data points over a stream with length at most $T$ using space that is sublinear (in $T$) in the continual release setting where the algorithm is required to output a clustering at every timestep. We achieve (1) an $O(1)$-multiplicative approximation with $\tilde{O}(k^{1.5} \cdot poly(d,\log(T)))$ space and $poly(k,d,\log(T))$ additive error, or (2) a $(1+γ)$-multiplicative approximation with $\tilde{O}_γ(poly(k,2^{O_γ(d)},\log(T)))$ space for any $γ>0$, and the additive error is $poly(k,2^{O_γ(d)},\log(T))$. Our main technical contribution is a differentially private clustering framework for data streams which only requires an offline DP coreset or clustering algorithm as a blackbox.
