Federated K-means Clustering
Swier Garst, Marcel Reinders
TL;DR
To address clustering across distributed datasets without sharing raw data, the paper proposes Federated K-Means (FKM), an unsupervised federated clustering method. FKM handles heterogeneity by allowing clients to have different local cluster counts and by performing a server-side, weighted alignment of local means to form a global clustering, using a modified objective $F_{km} = \sum_{j=0}^M \min_{C_i \in C_g}(S_j||C_j - C_i||^2)$ where $S_j$ is the sample count of local cluster $C_j$. A privacy safeguard omits small clusters below a threshold $p$ (we use $p=2$) from transmission. Empirical results on synthetic 2D data and on FEMNIST show that FKM achieves performance close to centralized k-means and is more robust to heterogeneity than prior one-shot approaches, highlighting its potential for privacy-preserving distributed clustering.
Abstract
Federated learning is a technique that enables the use of distributed datasets for machine learning purposes without requiring data to be pooled, thereby better preserving privacy and ownership of the data. While supervised FL research has grown substantially over the last years, unsupervised FL methods remain scarce. This work introduces an algorithm which implements K-means clustering in a federated manner, addressing the challenges of varying number of clusters between centers, as well as convergence on less separable datasets.
