Differentially Private Clustering in Data Streams

Alessandro Epasto; Tamalika Mukherjee; Peilin Zhong

Differentially Private Clustering in Data Streams

Alessandro Epasto, Tamalika Mukherjee, Peilin Zhong

TL;DR

This work addresses the problem of differentially private clustering for streaming data, specifically $k$-means and $k$-median, under the continual release model with insertions only. It introduces a general DP clustering framework that reduces streaming DP clustering to offline DP coresets/clustering that can be plugged in as blackboxes, and it mitigates additive error growth by partitioning the data space into groups using a bicriteria center set and applying a DP merge-and-reduce framework within rings around these centers. The authors achieve two main results: (i) an $O(1)$-multiplicative approximation with sublinear space $\tilde{O}(k^{1.5} \cdot \text{poly}(d, \log T))$ and poly$(k,d,\log T)$ additive error, and (ii) a $(1+\gamma)$-multiplicative approximation with $\tilde{O}_{\gamma}(\text{poly}(k,2^{O_{\gamma}(d)}, \log T))$ space and poly$(k,2^{O_{\gamma}(d)}, \log T)$ additive error, for any $\gamma>0$. The framework supports using existing offline DP clustering or coreset algorithms as blackboxes, enabling direct transfer of offline DP guarantees to streaming settings. This work provides the first sublinear-space DP clustering in streams and offers a path toward practical, privacy-preserving, continual-release clustering for large-scale data. The results have significant implications for privacy-aware streaming analytics and public-se API accessibility for DP clustering components.

Abstract

Clustering problems (such as $k$-means and $k$-median) are fundamental unsupervised machine learning primitives, and streaming clustering algorithms have been extensively studied in the past. However, since data privacy becomes a central concern in many real-world applications, non-private clustering algorithms may not be as applicable in many scenarios. In this work, we provide the first differentially private algorithms for $k$-means and $k$-median clustering of $d$-dimensional Euclidean data points over a stream with length at most $T$ using space that is sublinear (in $T$) in the continual release setting where the algorithm is required to output a clustering at every timestep. We achieve (1) an $O(1)$-multiplicative approximation with $\tilde{O}(k^{1.5} \cdot poly(d,\log(T)))$ space and $poly(k,d,\log(T))$ additive error, or (2) a $(1+γ)$-multiplicative approximation with $\tilde{O}_γ(poly(k,2^{O_γ(d)},\log(T)))$ space for any $γ>0$, and the additive error is $poly(k,2^{O_γ(d)},\log(T))$. Our main technical contribution is a differentially private clustering framework for data streams which only requires an offline DP coreset or clustering algorithm as a blackbox.

Differentially Private Clustering in Data Streams

TL;DR

This work addresses the problem of differentially private clustering for streaming data, specifically

-means and

-median, under the continual release model with insertions only. It introduces a general DP clustering framework that reduces streaming DP clustering to offline DP coresets/clustering that can be plugged in as blackboxes, and it mitigates additive error growth by partitioning the data space into groups using a bicriteria center set and applying a DP merge-and-reduce framework within rings around these centers. The authors achieve two main results: (i) an

-multiplicative approximation with sublinear space

and poly

additive error, and (ii) a

-multiplicative approximation with

space and poly

additive error, for any

. The framework supports using existing offline DP clustering or coreset algorithms as blackboxes, enabling direct transfer of offline DP guarantees to streaming settings. This work provides the first sublinear-space DP clustering in streams and offers a path toward practical, privacy-preserving, continual-release clustering for large-scale data. The results have significant implications for privacy-aware streaming analytics and public-se API accessibility for DP clustering components.

Abstract

Clustering problems (such as

-means and

-median) are fundamental unsupervised machine learning primitives, and streaming clustering algorithms have been extensively studied in the past. However, since data privacy becomes a central concern in many real-world applications, non-private clustering algorithms may not be as applicable in many scenarios. In this work, we provide the first differentially private algorithms for

-means and

-median clustering of

-dimensional Euclidean data points over a stream with length at most

using space that is sublinear (in

) in the continual release setting where the algorithm is required to output a clustering at every timestep. We achieve (1) an

-multiplicative approximation with

space and

additive error, or (2) a

-multiplicative approximation with

space for any

, and the additive error is

. Our main technical contribution is a differentially private clustering framework for data streams which only requires an offline DP coreset or clustering algorithm as a blackbox.

Paper Structure (34 sections, 41 theorems, 40 equations, 5 algorithms)

This paper contains 34 sections, 41 theorems, 40 equations, 5 algorithms.

Introduction
Differential Privacy Model and Clustering Problem
Our Results
Related Work
Concurrent Works.
Our Techniques
Naive Merge and Reduce approaches fail.
Our Approach.
Bicriteria Approximation.
Grouping points and applying Merge and Reduce.
Charging Additive Error to Multiplicative Error.
Comparison of our techniques to EZNMC22.
Differentially Private Clustering Framework
Main Algorithm (\ref{['alg:extend-cluster']}).
Analysis.
...and 19 more sections

Key Result

Theorem 1

Given dimension $d$, clustering parameter $k$, arbitrary parameter $C_M$, a non-DP $(1+\gamma)$-coreset algorithm, an $(\varepsilon,\delta)$-DP $(\kappa,\eta_1,\eta_2)$-semicoreset algorithm ${\mathcal{A}}\xspace$ that outputs a semicoreset of size $SZ_{\mathcal{A}}\xspace(\cdot)$ and using space $S where $M=O( \frac{d^3 \eta_2 }{C_M})$.

Theorems & Definitions (79)

Definition 1: Differential privacy DMNS06
Definition 2: $(\kappa,\eta_1,\eta_2)$-semicoreset
Theorem 1: Main
Remark
Remark
Theorem 2
Theorem 3
Remark
Definition 3: Ring centered at a Set
Theorem 4
...and 69 more

Differentially Private Clustering in Data Streams

TL;DR

Abstract

Differentially Private Clustering in Data Streams

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (79)