Table of Contents
Fetching ...

DPM: Clustering Sensitive Data through Separation

Johannes Liebenow, Yara Schütt, Tanya Braun, Marcel Gehrke, Florian Thaeter, Esfandiar Mohammadi

TL;DR

A privacy-preserving clustering algorithm called DPM is presented that recursively separates a data set into clusters based on a geometrical clustering approach and achieves state-of-the-art utility on the standard clustering metrics and yields a clustering result much closer to that of the popular non-private KMeans algorithm without requiring the number of classes.

Abstract

Clustering is an important tool for data exploration where the goal is to subdivide a data set into disjoint clusters that fit well into the underlying data structure. When dealing with sensitive data, privacy-preserving algorithms aim to approximate the non-private baseline while minimising the leakage of sensitive information. State-of-the-art privacy-preserving clustering algorithms tend to output clusters that are good in terms of the standard metrics, inertia, silhouette score, and clustering accuracy, however, the clustering result strongly deviates from the non-private KMeans baseline. In this work, we present a privacy-preserving clustering algorithm called DPM that recursively separates a data set into clusters based on a geometrical clustering approach. In addition, DPM estimates most of the data-dependent hyper-parameters in a privacy-preserving way. We prove that DPM preserves Differential Privacy and analyse the utility guarantees of DPM. Finally, we conduct an extensive empirical evaluation for synthetic and real-life data sets. We show that DPM achieves state-of-the-art utility on the standard clustering metrics and yields a clustering result much closer to that of the popular non-private KMeans algorithm without requiring the number of classes.

DPM: Clustering Sensitive Data through Separation

TL;DR

A privacy-preserving clustering algorithm called DPM is presented that recursively separates a data set into clusters based on a geometrical clustering approach and achieves state-of-the-art utility on the standard clustering metrics and yields a clustering result much closer to that of the popular non-private KMeans algorithm without requiring the number of classes.

Abstract

Clustering is an important tool for data exploration where the goal is to subdivide a data set into disjoint clusters that fit well into the underlying data structure. When dealing with sensitive data, privacy-preserving algorithms aim to approximate the non-private baseline while minimising the leakage of sensitive information. State-of-the-art privacy-preserving clustering algorithms tend to output clusters that are good in terms of the standard metrics, inertia, silhouette score, and clustering accuracy, however, the clustering result strongly deviates from the non-private KMeans baseline. In this work, we present a privacy-preserving clustering algorithm called DPM that recursively separates a data set into clusters based on a geometrical clustering approach. In addition, DPM estimates most of the data-dependent hyper-parameters in a privacy-preserving way. We prove that DPM preserves Differential Privacy and analyse the utility guarantees of DPM. Finally, we conduct an extensive empirical evaluation for synthetic and real-life data sets. We show that DPM achieves state-of-the-art utility on the standard clustering metrics and yields a clustering result much closer to that of the popular non-private KMeans algorithm without requiring the number of classes.
Paper Structure (59 sections, 18 theorems, 47 equations, 6 figures, 3 tables, 2 algorithms)

This paper contains 59 sections, 18 theorems, 47 equations, 6 figures, 3 tables, 2 algorithms.

Key Result

Corollary 1

Given $\varepsilon \in \mathbb{R}_{>0}$ and a function $f$ with sensitivity $\Delta_f$, adding noise randomly drawn from $\text{Lap}(\Delta_f/\varepsilon)$ preserves $(\varepsilon, 0)$-DP.

Figures (6)

  • Figure 1: The distance ($\downarrow$) between the clustering result of privacy-preserving clustering algorithms and the non-private KMeans. Our proposed algorithm DPM outputs a clustering result that is close to that of the non-private baseline. Except for DPM, all algorithms received the reported number of classes as input for the number of cluster centres. All algorithms use a privacy budget of $\varepsilon = 1$ and $\delta = 1/(n \cdot \sqrt{n})$ and the values are taken from \ref{['tab:kopt_main']}.
  • Figure 2: A single recursion step in the process of DPM. ➀ The data points are projected onto each dimension and multiple split candidates are generated based on a fixed split interval size that is calibrated to the data set. ➁ A scoring function that depends on the specific clustering goal assigns a score to each split candidate. ➂ The split candidate with the highest score gets selected whp. to subdivide the data set into two disjoint subsets. This procedure is recursively repeated until only a few elements in each subset remain.
  • Figure 3: A visualisation of the scoring function that DPM uses to evaluate split candidates. In order to assign a score, the data points are projected to each dimension. Then, for every split candidate emptiness (light blue) and centreness (light green) are computed.
  • Figure 4: Evaluation of the running time (in seconds) of all algorithms including the non-private baseline averaged over $10$ runs. As shown in \ref{['ssec:timeComplexityDPM']}, the running time of DPM increases significantly with a large number of dimensions and slightly with an increasing number of data points. In general, the running time of DPM remains competitive with the other clustering algorithms.
  • Figure 5: Evaluation of the privacy budget distribution of DPM with respect to different clustering metrics for the MNIST Embs. data set. Each plot shows the upper (green) and lower (orange) $20\%$ of privacy budget distributions, averaged over $10$ runs with $\varepsilon_{\text{int}} + \varepsilon_{\text{exp}} + \varepsilon_{\text{cnt}} + \varepsilon_{\text{avg}} = 1$. Each metric prefers a slightly different privacy budget distribution, thus for the experiments we choose a distribution that performs well for all metrics and that is not specific for any data set.
  • ...and 1 more figures

Theorems & Definitions (50)

  • Definition 1: Neighbouring data sets
  • Definition 2: Sensitivity
  • Definition 3: $(\varepsilon, \delta)$-DP
  • Corollary 1: DP of Laplace Mechanism Laplace
  • Definition 4: Exponential Mechanism
  • Definition 5: Emptiness
  • Definition 6: Centreness
  • Definition 7: Scoring function
  • Definition 8: Shifted Noised Count
  • Corollary 2: DP of shifted noisy count
  • ...and 40 more