Federated Clustering: An Unsupervised Cluster-Wise Training for Decentralized Data Distributions

Mirko Nardi; Lorenzo Valerio; Andrea Passarella

Federated Clustering: An Unsupervised Cluster-Wise Training for Decentralized Data Distributions

Mirko Nardi, Lorenzo Valerio, Andrea Passarella

TL;DR

FedCRef addresses unsupervised federated clustering under decentralized, non-IID data with an unknown global cluster count $K_G$. It combines cluster-wise local representation learning, cross-client model exchange based on reconstruction error, and graph-based federated grouping with iterative refinement and stopping criteria. The method identifies all underlying distributions without labels and aligns local clusters to global categories, achieving up to $ACC \approx 0.95$ on multiple datasets while remaining scalable and lightweight for edge devices. This work enables privacy-preserving discovery of data distributions across distributed systems and offers robust performance under noisy initializations and varying data overlap.

Abstract

Federated Learning (FL) enables decentralized machine learning while preserving data privacy, making it ideal for sensitive applications where data cannot be shared. While FL has been widely studied in supervised contexts, its application to unsupervised learning remains underdeveloped. This work introduces FedCRef, a novel unsupervised federated learning method designed to uncover all underlying data distributions across decentralized clients without requiring labels. This task, known as Federated Clustering, presents challenges due to heterogeneous, non-uniform data distributions and the lack of centralized coordination. Unlike previous methods that assume a one-cluster-per-client setup or require prior knowledge of the number of clusters, FedCRef generalizes to multi-cluster-per-client scenarios. Clients iteratively refine their data partitions while discovering all distinct distributions in the system. The process combines local clustering, model exchange and evaluation via reconstruction error analysis, and collaborative refinement within federated groups of similar distributions to enhance clustering accuracy. Extensive evaluations on four public datasets (EMNIST, KMNIST, Fashion-MNIST and KMNIST49) show that FedCRef successfully identifies true global data distributions, achieving an average local accuracy of up to 95%. The method is also robust to noisy conditions, scalable, and lightweight, making it suitable for resource-constrained edge devices.

Federated Clustering: An Unsupervised Cluster-Wise Training for Decentralized Data Distributions

TL;DR

FedCRef addresses unsupervised federated clustering under decentralized, non-IID data with an unknown global cluster count

. It combines cluster-wise local representation learning, cross-client model exchange based on reconstruction error, and graph-based federated grouping with iterative refinement and stopping criteria. The method identifies all underlying distributions without labels and aligns local clusters to global categories, achieving up to

on multiple datasets while remaining scalable and lightweight for edge devices. This work enables privacy-preserving discovery of data distributions across distributed systems and offers robust performance under noisy initializations and varying data overlap.

Abstract

Paper Structure (30 sections, 8 equations, 10 figures, 12 tables, 2 algorithms)

This paper contains 30 sections, 8 equations, 10 figures, 12 tables, 2 algorithms.

Introduction
Related work
Problem Statement and System Assumptions
Objective 1:
Objective 2:
Objective 3:
Methodology
Step 1: Compute reconstruction errors:
Step 2 Measure error differences:
Step 3 Apply threshold test:
Step 1: Models Evaluation
Step 2: Best Model Selection
Step 3: Cluster Assignment
Step 4: Iterative Refinement for $K_i$ Clusters
Step 5: Final Assignment of Remaining Samples
...and 15 more sections

Figures (10)

Figure 1: Schematic representation of the considered federated learning system with $N=4$ clients. The objective is to identify the set of unique global data distributions $U$ (global $K_G=4$) represented by distinct shapes. Within each client $C_i$, $K_i$ denotes the number of unique data distributions present in the local dataset. The dotted lines indicate a possible clustering split $Q_i$, highlighting how imperfect local clusters may be formed within each client’s dataset.
Figure 2: Illustration of Step 1 for client $C_i$ with two clusters ($K_i=2$). The dotted line indicates the initial clustering split of the local dataset into $Q_{i1}$ and $Q_{i2}$, which may be imperfectly aligned to the true categories. In this step, a representation model $M$ is trained locally for each cluster.
Figure 3: Diagram illustrating the model exchange mechanism: Client $C_i$ assesses the reconstruction effectiveness of a model $M_{jp}$ received from Client $C_j$, and vice versa, to establish if clusters $Q_{iq}$ and $Q_{jp}$ indicate similar underlying distributions.
Figure 4: Diagram illustrating the identification and training of federated groups. In the example, $G_p$ consists of three associated clusters and a corresponding federated model$M_{G_p}$ is trained using the samples involved.
Figure 5: Illustration of cluster refinement for client $C_i$: The client receives all the trained federated models and use a subroutine to refine its local clustering ($K_i=2$), i.e., to achieve better alignment with the true categories.
...and 5 more figures

Theorems & Definitions (2)

Definition 4.1: Wrong Association
Definition 4.2: Missed Association

Federated Clustering: An Unsupervised Cluster-Wise Training for Decentralized Data Distributions

TL;DR

Abstract

Federated Clustering: An Unsupervised Cluster-Wise Training for Decentralized Data Distributions

Authors

TL;DR

Abstract

Table of Contents

Figures (10)

Theorems & Definitions (2)