Table of Contents
Fetching ...

On the Price of Differential Privacy for Hierarchical Clustering

Chengyuan Deng, Jie Gao, Jalaj Upadhyay, Chen Wang, Samson Zhou

TL;DR

This work studies hierarchical clustering (HC) under differential privacy, focusing on weight-level DP with unit-edge weights to overcome the harsh edge-DP lower bounds. The authors design a polynomial-time algorithm that achieves a multiplicative $O\left(\frac{\log^{1.5} n}{\varepsilon}\right)$-approximation to Dasgupta's HC cost by perturbing weights and leveraging a private balanced sparsest cut, integrated into a recursive HC construction via post-processing. They prove a matching lower bound of $\Omega\left(\frac{n^{2}}{\varepsilon}\right)$ additive error for weight-DP when the unit-weight assumption is dropped, and they derive a new $\tilde{\Omega}\left(\frac{1}{\varepsilon}\right)$ lower bound for weight-level DP balanced sparsest cuts. Empirical results on synthetic and real datasets show substantial improvements over input perturbation baselines and favorable scalability. Overall, the paper demonstrates that under a natural weight-DP model, HC can be practically private with strong theoretical guarantees, while also clarifying the limitations of weight-level privacy through concrete lower bounds.

Abstract

Hierarchical clustering is a fundamental unsupervised machine learning task with the aim of organizing data into a hierarchy of clusters. Many applications of hierarchical clustering involve sensitive user information, therefore motivating recent studies on differentially private hierarchical clustering under the rigorous framework of Dasgupta's objective. However, it has been shown that any privacy-preserving algorithm under edge-level differential privacy necessarily suffers a large error. To capture practical applications of this problem, we focus on the weight privacy model, where each edge of the input graph is at least unit weight. We present a novel algorithm in the weight privacy model that shows significantly better approximation than known impossibility results in the edge-level DP setting. In particular, our algorithm achieves $O(\log^{1.5}n/\varepsilon)$ multiplicative error for $\varepsilon$-DP and runs in polynomial time, where $n$ is the size of the input graph, and the cost is never worse than the optimal additive error in existing work. We complement our algorithm by showing if the unit-weight constraint does not apply, the lower bound for weight-level DP hierarchical clustering is essentially the same as the edge-level DP, i.e. $Ω(n^2/\varepsilon)$ additive error. As a result, we also obtain a new lower bound of $\tildeΩ(1/\varepsilon)$ additive error for balanced sparsest cuts in the weight-level DP model, which may be of independent interest. Finally, we evaluate our algorithm on synthetic and real-world datasets. Our experimental results show that our algorithm performs well in terms of extra cost and has good scalability to large graphs.

On the Price of Differential Privacy for Hierarchical Clustering

TL;DR

This work studies hierarchical clustering (HC) under differential privacy, focusing on weight-level DP with unit-edge weights to overcome the harsh edge-DP lower bounds. The authors design a polynomial-time algorithm that achieves a multiplicative -approximation to Dasgupta's HC cost by perturbing weights and leveraging a private balanced sparsest cut, integrated into a recursive HC construction via post-processing. They prove a matching lower bound of additive error for weight-DP when the unit-weight assumption is dropped, and they derive a new lower bound for weight-level DP balanced sparsest cuts. Empirical results on synthetic and real datasets show substantial improvements over input perturbation baselines and favorable scalability. Overall, the paper demonstrates that under a natural weight-DP model, HC can be practically private with strong theoretical guarantees, while also clarifying the limitations of weight-level privacy through concrete lower bounds.

Abstract

Hierarchical clustering is a fundamental unsupervised machine learning task with the aim of organizing data into a hierarchy of clusters. Many applications of hierarchical clustering involve sensitive user information, therefore motivating recent studies on differentially private hierarchical clustering under the rigorous framework of Dasgupta's objective. However, it has been shown that any privacy-preserving algorithm under edge-level differential privacy necessarily suffers a large error. To capture practical applications of this problem, we focus on the weight privacy model, where each edge of the input graph is at least unit weight. We present a novel algorithm in the weight privacy model that shows significantly better approximation than known impossibility results in the edge-level DP setting. In particular, our algorithm achieves multiplicative error for -DP and runs in polynomial time, where is the size of the input graph, and the cost is never worse than the optimal additive error in existing work. We complement our algorithm by showing if the unit-weight constraint does not apply, the lower bound for weight-level DP hierarchical clustering is essentially the same as the edge-level DP, i.e. additive error. As a result, we also obtain a new lower bound of additive error for balanced sparsest cuts in the weight-level DP model, which may be of independent interest. Finally, we evaluate our algorithm on synthetic and real-world datasets. Our experimental results show that our algorithm performs well in terms of extra cost and has good scalability to large graphs.

Paper Structure

This paper contains 34 sections, 29 theorems, 60 equations, 9 figures, 2 tables.

Key Result

Proposition 2.1

Let $\mathcal{T}\xspace$ be an HC tree that is obtained by recursively performing $\alpha$-approximate $1/3$-balanced sparsest cut on the vertex-induced subgraphs. Then, $\mathcal{T}\xspace$ gives an $O(\alpha)$-approximation to the optimal Dasgupta's HC cost.

Figures (9)

  • Figure 1: Comparison of Dasgupta's cost on SBM graphs of size $n=150$ and $k=5$.
  • Figure 1: Comparison of Runtime (s) with $n=6,8,10$
  • Figure 2: Comparison of Dasgupta's cost on HSBM graphs of size $n=150$ and $k=5$.
  • Figure 3: Comparison of Dasgupta's cost on real-world datasets: IRIS, WINE and BOSTON.
  • Figure 4: Runtime with $n$ scaling up to 1500
  • ...and 4 more figures

Theorems & Definitions (66)

  • Definition 1: Sparsest Cut
  • Definition 2: Hierarchical clustering trees
  • Definition 3: Hierarchical Clustering under Dasgupta's Objective dasgupta2016cost
  • Proposition 2.1: charikar2017approximatedasgupta2016cost
  • Definition 4: Neighboring weights
  • Definition 5: Differential Privacy
  • Definition 6: Neighboring weights
  • Theorem 1: Formalization of Result \ref{['rst:hc-new-upper']}
  • Proposition 3.1: charikar2017approximatekrauthgamer2009partitioning, rephrased
  • Lemma 3.1
  • ...and 56 more