DPM: Clustering Sensitive Data through Separation

Johannes Liebenow; Yara Schütt; Tanya Braun; Marcel Gehrke; Florian Thaeter; Esfandiar Mohammadi

DPM: Clustering Sensitive Data through Separation

Johannes Liebenow, Yara Schütt, Tanya Braun, Marcel Gehrke, Florian Thaeter, Esfandiar Mohammadi

TL;DR

A privacy-preserving clustering algorithm called DPM is presented that recursively separates a data set into clusters based on a geometrical clustering approach and achieves state-of-the-art utility on the standard clustering metrics and yields a clustering result much closer to that of the popular non-private KMeans algorithm without requiring the number of classes.

Abstract

Clustering is an important tool for data exploration where the goal is to subdivide a data set into disjoint clusters that fit well into the underlying data structure. When dealing with sensitive data, privacy-preserving algorithms aim to approximate the non-private baseline while minimising the leakage of sensitive information. State-of-the-art privacy-preserving clustering algorithms tend to output clusters that are good in terms of the standard metrics, inertia, silhouette score, and clustering accuracy, however, the clustering result strongly deviates from the non-private KMeans baseline. In this work, we present a privacy-preserving clustering algorithm called DPM that recursively separates a data set into clusters based on a geometrical clustering approach. In addition, DPM estimates most of the data-dependent hyper-parameters in a privacy-preserving way. We prove that DPM preserves Differential Privacy and analyse the utility guarantees of DPM. Finally, we conduct an extensive empirical evaluation for synthetic and real-life data sets. We show that DPM achieves state-of-the-art utility on the standard clustering metrics and yields a clustering result much closer to that of the popular non-private KMeans algorithm without requiring the number of classes.

DPM: Clustering Sensitive Data through Separation

TL;DR

Abstract

Paper Structure (59 sections, 18 theorems, 47 equations, 6 figures, 3 tables, 2 algorithms)

This paper contains 59 sections, 18 theorems, 47 equations, 6 figures, 3 tables, 2 algorithms.

Introduction
Contribution
Structure
Preliminaries
Notation
Clustering
Differential Privacy
DPM
Methodology
Splits Through Sparse Regions
Emptiness
Centreness
Enabling Privacy-Preserving Splits
Noisy Count
Distributing the Privacy Budget
...and 44 more sections

Key Result

Corollary 1

Given $\varepsilon \in \mathbb{R}_{>0}$ and a function $f$ with sensitivity $\Delta_f$, adding noise randomly drawn from $\text{Lap}(\Delta_f/\varepsilon)$ preserves $(\varepsilon, 0)$-DP.

Figures (6)

Figure 1: The distance ($\downarrow$) between the clustering result of privacy-preserving clustering algorithms and the non-private KMeans. Our proposed algorithm DPM outputs a clustering result that is close to that of the non-private baseline. Except for DPM, all algorithms received the reported number of classes as input for the number of cluster centres. All algorithms use a privacy budget of $\varepsilon = 1$ and $\delta = 1/(n \cdot \sqrt{n})$ and the values are taken from \ref{['tab:kopt_main']}.
Figure 2: A single recursion step in the process of DPM. ➀ The data points are projected onto each dimension and multiple split candidates are generated based on a fixed split interval size that is calibrated to the data set. ➁ A scoring function that depends on the specific clustering goal assigns a score to each split candidate. ➂ The split candidate with the highest score gets selected whp. to subdivide the data set into two disjoint subsets. This procedure is recursively repeated until only a few elements in each subset remain.
Figure 3: A visualisation of the scoring function that DPM uses to evaluate split candidates. In order to assign a score, the data points are projected to each dimension. Then, for every split candidate emptiness (light blue) and centreness (light green) are computed.
Figure 4: Evaluation of the running time (in seconds) of all algorithms including the non-private baseline averaged over $10$ runs. As shown in \ref{['ssec:timeComplexityDPM']}, the running time of DPM increases significantly with a large number of dimensions and slightly with an increasing number of data points. In general, the running time of DPM remains competitive with the other clustering algorithms.
Figure 5: Evaluation of the privacy budget distribution of DPM with respect to different clustering metrics for the MNIST Embs. data set. Each plot shows the upper (green) and lower (orange) $20\%$ of privacy budget distributions, averaged over $10$ runs with $\varepsilon_{\text{int}} + \varepsilon_{\text{exp}} + \varepsilon_{\text{cnt}} + \varepsilon_{\text{avg}} = 1$. Each metric prefers a slightly different privacy budget distribution, thus for the experiments we choose a distribution that performs well for all metrics and that is not specific for any data set.
...and 1 more figures

Theorems & Definitions (50)

Definition 1: Neighbouring data sets
Definition 2: Sensitivity
Definition 3: $(\varepsilon, \delta)$-DP
Corollary 1: DP of Laplace Mechanism Laplace
Definition 4: Exponential Mechanism
Definition 5: Emptiness
Definition 6: Centreness
Definition 7: Scoring function
Definition 8: Shifted Noised Count
Corollary 2: DP of shifted noisy count
...and 40 more

DPM: Clustering Sensitive Data through Separation

TL;DR

Abstract

DPM: Clustering Sensitive Data through Separation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (50)