Table of Contents
Fetching ...

Jigsaw Game: Federated Clustering

Jinxuan Xu, Hong-You Chen, Wei-Lun Chao, Yuqian Zhang

TL;DR

The paper tackles federated clustering for unlabeled data by formulating a federated $k$-means objective $G(\mathcal{C})=\sum_m G_m(\mathcal{C})$ and introducing FeCA, a one-shot method that refines each client’s local centroids and aggregates them at the server to recover the global centroids $\mathcal{C}^*$. It exploits the structured nature of local solutions (one-to-many and many-to-one associations) under separation conditions, aided by a RadiusAssign/ServerAggregation pipeline and theoretical guarantees under the Stochastic Ball Model. The authors extend FeCA to DeepFeCA for federated unsupervised representation learning by iterating with DeepCluster-inspired pseudo-labeling, yielding competitive results on CIFAR and Tiny-ImageNet in federated settings. Empirical results across synthetic and real datasets show FeCA’s robustness to non-IID data and its ability to recover global centroids in a single round, often surpassing centralized baselines due to leveraging diverse local solutions. Overall, the approach offers a strong, communication-efficient framework for federated unsupervised learning with practical impact on privacy-preserving clustering and representation learning.

Abstract

Federated learning has recently garnered significant attention, especially within the domain of supervised learning. However, despite the abundance of unlabeled data on end-users, unsupervised learning problems such as clustering in the federated setting remain underexplored. In this paper, we investigate the federated clustering problem, with a focus on federated k-means. We outline the challenge posed by its non-convex objective and data heterogeneity in the federated framework. To tackle these challenges, we adopt a new perspective by studying the structures of local solutions in k-means and propose a one-shot algorithm called FeCA (Federated Centroid Aggregation). FeCA adaptively refines local solutions on clients, then aggregates these refined solutions to recover the global solution of the entire dataset in a single round. We empirically demonstrate the robustness of FeCA under various federated scenarios on both synthetic and real-world data. Additionally, we extend FeCA to representation learning and present DeepFeCA, which combines DeepCluster and FeCA for unsupervised feature learning in the federated setting.

Jigsaw Game: Federated Clustering

TL;DR

The paper tackles federated clustering for unlabeled data by formulating a federated -means objective and introducing FeCA, a one-shot method that refines each client’s local centroids and aggregates them at the server to recover the global centroids . It exploits the structured nature of local solutions (one-to-many and many-to-one associations) under separation conditions, aided by a RadiusAssign/ServerAggregation pipeline and theoretical guarantees under the Stochastic Ball Model. The authors extend FeCA to DeepFeCA for federated unsupervised representation learning by iterating with DeepCluster-inspired pseudo-labeling, yielding competitive results on CIFAR and Tiny-ImageNet in federated settings. Empirical results across synthetic and real datasets show FeCA’s robustness to non-IID data and its ability to recover global centroids in a single round, often surpassing centralized baselines due to leveraging diverse local solutions. Overall, the approach offers a strong, communication-efficient framework for federated unsupervised learning with practical impact on privacy-preserving clustering and representation learning.

Abstract

Federated learning has recently garnered significant attention, especially within the domain of supervised learning. However, despite the abundance of unlabeled data on end-users, unsupervised learning problems such as clustering in the federated setting remain underexplored. In this paper, we investigate the federated clustering problem, with a focus on federated k-means. We outline the challenge posed by its non-convex objective and data heterogeneity in the federated framework. To tackle these challenges, we adopt a new perspective by studying the structures of local solutions in k-means and propose a one-shot algorithm called FeCA (Federated Centroid Aggregation). FeCA adaptively refines local solutions on clients, then aggregates these refined solutions to recover the global solution of the entire dataset in a single round. We empirically demonstrate the robustness of FeCA under various federated scenarios on both synthetic and real-world data. Additionally, we extend FeCA to representation learning and present DeepFeCA, which combines DeepCluster and FeCA for unsupervised feature learning in the federated setting.
Paper Structure (34 sections, 4 theorems, 49 equations, 18 figures, 10 tables, 6 algorithms)

This paper contains 34 sections, 4 theorems, 49 equations, 18 figures, 10 tables, 6 algorithms.

Key Result

Theorem 4.1

(Main Theorem) Under the Stochastic Ball Model, for some constants $\lambda\geq 3$ and $\eta \geq 5$, if then by utilizing the radius determined by Algorithm alg:FeCA-RadiusAssign2, any output centroid $c_s^*$ from Algorithm alg:FeCA is close to some ground truth center:

Figures (18)

  • Figure 1: Clustering results. (Left): global and local solutions on centralized/IID client's data; (Right): global solutions for non-IID client's data sharing similar structures.
  • Figure 2: FeCA roadmap. 1st column: The centralized dataset distributed to clients. 2nd column: The $k$-means clustering results on different clients under non-IID data sample scenario, where black triangles and squares represent centroids. 3rd column: Eliminating one-fit-many centroids in Algorithm \ref{['alg:FeCA-ClientUpdate']}, indicated by hollow squares and triangles. 4th column: Centroids sent to the server. 5th column: Aggregation of received centroids on the server where red crosses represent recovered centroids.
  • Figure 3: Illustrations of $\ell_2$-distance results in \ref{['exp:synthetic_l2']} with 10 random seeds.
  • Figure 4: Visualizations of S-sets (S1&S4) and recovered centroids by different methods. Results are showcased under the Dirichlet($0.3$) data sample scenario. Blue dots represent recovered centroids, and red crosses indicate the ground truth centers.
  • Figure 5: Evaluation of $\sigma$ on S-sets (S1) across three data sample scenarios.$\sigma_i$ for $k$ clusters is represented in different colors. The values of $\sigma_i$ for all returned centroids $c_i$ are reported over $3$ random runs, with the red star marking the maximum $\sigma_i$ observed in three runs. A $\sigma_i$ value below $0.5$ indicates that, the server effectively groups centroid $c_i$ utilizing the radius $r_s$ assigned by Algorithm \ref{['alg:FeCA-RadiusAssign2-empirical']}.
  • ...and 13 more figures

Theorems & Definitions (9)

  • Theorem 4.1
  • Remark 6.1
  • Lemma A.1
  • proof
  • Lemma A.2
  • proof
  • Lemma A.3
  • proof
  • proof