Table of Contents
Fetching ...

Federated K-means Clustering

Swier Garst, Marcel Reinders

TL;DR

To address clustering across distributed datasets without sharing raw data, the paper proposes Federated K-Means (FKM), an unsupervised federated clustering method. FKM handles heterogeneity by allowing clients to have different local cluster counts and by performing a server-side, weighted alignment of local means to form a global clustering, using a modified objective $F_{km} = \sum_{j=0}^M \min_{C_i \in C_g}(S_j||C_j - C_i||^2)$ where $S_j$ is the sample count of local cluster $C_j$. A privacy safeguard omits small clusters below a threshold $p$ (we use $p=2$) from transmission. Empirical results on synthetic 2D data and on FEMNIST show that FKM achieves performance close to centralized k-means and is more robust to heterogeneity than prior one-shot approaches, highlighting its potential for privacy-preserving distributed clustering.

Abstract

Federated learning is a technique that enables the use of distributed datasets for machine learning purposes without requiring data to be pooled, thereby better preserving privacy and ownership of the data. While supervised FL research has grown substantially over the last years, unsupervised FL methods remain scarce. This work introduces an algorithm which implements K-means clustering in a federated manner, addressing the challenges of varying number of clusters between centers, as well as convergence on less separable datasets.

Federated K-means Clustering

TL;DR

To address clustering across distributed datasets without sharing raw data, the paper proposes Federated K-Means (FKM), an unsupervised federated clustering method. FKM handles heterogeneity by allowing clients to have different local cluster counts and by performing a server-side, weighted alignment of local means to form a global clustering, using a modified objective where is the sample count of local cluster . A privacy safeguard omits small clusters below a threshold (we use ) from transmission. Empirical results on synthetic 2D data and on FEMNIST show that FKM achieves performance close to centralized k-means and is more robust to heterogeneity than prior one-shot approaches, highlighting its potential for privacy-preserving distributed clustering.

Abstract

Federated learning is a technique that enables the use of distributed datasets for machine learning purposes without requiring data to be pooled, thereby better preserving privacy and ownership of the data. While supervised FL research has grown substantially over the last years, unsupervised FL methods remain scarce. This work introduces an algorithm which implements K-means clustering in a federated manner, addressing the challenges of varying number of clusters between centers, as well as convergence on less separable datasets.
Paper Structure (9 sections, 3 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 9 sections, 3 equations, 7 figures, 1 table, 1 algorithm.

Figures (7)

  • Figure 1: The regular synthetic datasets. (a) shows the original sampling of the regular synthetic dataset, with the defined cluster means (from which the data are generated using a normal distribution N(0,1)) in red. (b) shows ARI results on all three datasets. (c) until (e) shows how this dataset is distributed over five different clients using different values of $\beta$. Different colors indicate the different clients.
  • Figure 2: Some of the data distributions of the simulated datasets with increasing levels of noise (columns), using 50 or 200 points per cluster (rows).
  • Figure 3: Clustering results on the synthetic dataset when using different levels of noise for different values of $\beta$. (a) to (c) show the final ARI scores for beta = 0.1, 1 and 10, respectively. (d) to (f) show how the ARI score for FKM converges over time, each corresponding to the figure above it.
  • Figure 4: Assessment of the method on data with a large variability of local clusters per client. (a) shows the distribution per client, (b) the ARI results for different methods.
  • Figure 5: Results on (a subset of) FEMNIST. (a) shows the silhoutte score, (b) the simplified silhouette score.
  • ...and 2 more figures