Table of Contents
Fetching ...

Privacy-Preserving Vertical K-Means Clustering

Federico Mazzone, Trevor Brown, Florian Kerschbaum, Kevin H. Wilson, Maarten Everts, Florian Hahn, Andreas Peter

TL;DR

This work tackles privacy-preserving k-means clustering in vertically partitioned data by combining CKKS-based homomorphic encryption with differential privacy applied only to centroids. The method securely outsources features once and runs Lloyd’s algorithm on encrypted data, achieving a communication complexity of O(n+kt) and maintaining accuracy close to plaintext clustering. Key contributions include an optimized argmin encoding, packing of multiple argmins in a single ciphertext, padding-free replication, and extensions to higher dimensions and multiple parties, with strong empirical results showing dramatic WAN-friendly speedups over MPC baselines. The approach enables scalable, accurate clustering across millions of points in constrained networks, offering practical privacy guarantees for real-world cross-institutional analytics.

Abstract

Clustering is a fundamental data processing task used for grouping records based on one or more features. In the vertically partitioned setting, data is distributed among entities, with each holding only a subset of those features. A key challenge in this scenario is that computing distances between records requires access to all distributed features, which may be privacy-sensitive and cannot be directly shared with other parties. The goal is to compute the joint clusters while preserving the privacy of each entity's dataset. Existing solutions using secret sharing or garbled circuits implement privacy-preserving variants of Lloyd's algorithm but incur high communication costs, scaling as O(nkt), where n is the number of data points, k the number of clusters, and t the number of rounds. These methods become impractical for large datasets or several parties, limiting their use to LAN settings only. On the other hand, a different line of solutions rely on differential privacy (DP) to outsource the local features of the parties to a central server. However, they often significantly degrade the utility of the clustering outcome due to excessive noise. In this work, we propose a novel solution based on homomorphic encryption and DP, reducing communication complexity to O(n+kt). In our method, parties securely outsource their features once, allowing a computing party to perform clustering operations under encryption. DP is applied only to the clusters' centroids, ensuring privacy with minimal impact on utility. Our solution clusters 100,000 two-dimensional points into five clusters using only 73MB of communication, compared to 101GB for existing works, and completes in just under 3 minutes on a 100Mbps network, whereas existing works take over 1 day. This makes our solution practical even for WAN deployments, all while maintaining accuracy comparable to plaintext k-means algorithms.

Privacy-Preserving Vertical K-Means Clustering

TL;DR

This work tackles privacy-preserving k-means clustering in vertically partitioned data by combining CKKS-based homomorphic encryption with differential privacy applied only to centroids. The method securely outsources features once and runs Lloyd’s algorithm on encrypted data, achieving a communication complexity of O(n+kt) and maintaining accuracy close to plaintext clustering. Key contributions include an optimized argmin encoding, packing of multiple argmins in a single ciphertext, padding-free replication, and extensions to higher dimensions and multiple parties, with strong empirical results showing dramatic WAN-friendly speedups over MPC baselines. The approach enables scalable, accurate clustering across millions of points in constrained networks, offering practical privacy guarantees for real-world cross-institutional analytics.

Abstract

Clustering is a fundamental data processing task used for grouping records based on one or more features. In the vertically partitioned setting, data is distributed among entities, with each holding only a subset of those features. A key challenge in this scenario is that computing distances between records requires access to all distributed features, which may be privacy-sensitive and cannot be directly shared with other parties. The goal is to compute the joint clusters while preserving the privacy of each entity's dataset. Existing solutions using secret sharing or garbled circuits implement privacy-preserving variants of Lloyd's algorithm but incur high communication costs, scaling as O(nkt), where n is the number of data points, k the number of clusters, and t the number of rounds. These methods become impractical for large datasets or several parties, limiting their use to LAN settings only. On the other hand, a different line of solutions rely on differential privacy (DP) to outsource the local features of the parties to a central server. However, they often significantly degrade the utility of the clustering outcome due to excessive noise. In this work, we propose a novel solution based on homomorphic encryption and DP, reducing communication complexity to O(n+kt). In our method, parties securely outsource their features once, allowing a computing party to perform clustering operations under encryption. DP is applied only to the clusters' centroids, ensuring privacy with minimal impact on utility. Our solution clusters 100,000 two-dimensional points into five clusters using only 73MB of communication, compared to 101GB for existing works, and completes in just under 3 minutes on a 100Mbps network, whereas existing works take over 1 day. This makes our solution practical even for WAN deployments, all while maintaining accuracy comparable to plaintext k-means algorithms.

Paper Structure

This paper contains 43 sections, 11 equations, 5 figures, 4 tables, 8 algorithms.

Figures (5)

  • Figure 1: Secure k-means clustering protocol steps.
  • Figure 2: Schematic example of optimized encoding for $k = 3$.
  • Figure 3: Schematic example of extracting and re-encoding multiple points from $X^B$ in one ciphertext. In this example we have $k = 3$ clusters, $18$ ciphertext slots, and we extract and re-encode the $6$th element of each $k \times k$ square.
  • Figure 4: Runtime of 10 iterations of our approach for increasing dimensionality. The number of points and clusters are fixed to $n = 100{,}000$ and $k = 5$, respectively. Experiments are conducted in the regWAN500 network environment.
  • Figure 5: K-Means loss and accuracy of our solution vs. Li, Wang, and Li li2022differentially on S1, for varying privacy budget. The dashed lines represent the plaintext baseline.