On the Price of Differential Privacy for Hierarchical Clustering
Chengyuan Deng, Jie Gao, Jalaj Upadhyay, Chen Wang, Samson Zhou
TL;DR
This work studies hierarchical clustering (HC) under differential privacy, focusing on weight-level DP with unit-edge weights to overcome the harsh edge-DP lower bounds. The authors design a polynomial-time algorithm that achieves a multiplicative $O\left(\frac{\log^{1.5} n}{\varepsilon}\right)$-approximation to Dasgupta's HC cost by perturbing weights and leveraging a private balanced sparsest cut, integrated into a recursive HC construction via post-processing. They prove a matching lower bound of $\Omega\left(\frac{n^{2}}{\varepsilon}\right)$ additive error for weight-DP when the unit-weight assumption is dropped, and they derive a new $\tilde{\Omega}\left(\frac{1}{\varepsilon}\right)$ lower bound for weight-level DP balanced sparsest cuts. Empirical results on synthetic and real datasets show substantial improvements over input perturbation baselines and favorable scalability. Overall, the paper demonstrates that under a natural weight-DP model, HC can be practically private with strong theoretical guarantees, while also clarifying the limitations of weight-level privacy through concrete lower bounds.
Abstract
Hierarchical clustering is a fundamental unsupervised machine learning task with the aim of organizing data into a hierarchy of clusters. Many applications of hierarchical clustering involve sensitive user information, therefore motivating recent studies on differentially private hierarchical clustering under the rigorous framework of Dasgupta's objective. However, it has been shown that any privacy-preserving algorithm under edge-level differential privacy necessarily suffers a large error. To capture practical applications of this problem, we focus on the weight privacy model, where each edge of the input graph is at least unit weight. We present a novel algorithm in the weight privacy model that shows significantly better approximation than known impossibility results in the edge-level DP setting. In particular, our algorithm achieves $O(\log^{1.5}n/\varepsilon)$ multiplicative error for $\varepsilon$-DP and runs in polynomial time, where $n$ is the size of the input graph, and the cost is never worse than the optimal additive error in existing work. We complement our algorithm by showing if the unit-weight constraint does not apply, the lower bound for weight-level DP hierarchical clustering is essentially the same as the edge-level DP, i.e. $Ω(n^2/\varepsilon)$ additive error. As a result, we also obtain a new lower bound of $\tildeΩ(1/\varepsilon)$ additive error for balanced sparsest cuts in the weight-level DP model, which may be of independent interest. Finally, we evaluate our algorithm on synthetic and real-world datasets. Our experimental results show that our algorithm performs well in terms of extra cost and has good scalability to large graphs.
