Table of Contents
Fetching ...

Fair Federated Data Clustering through Personalization: Bridging the Gap between Diverse Data Distributions

Shivam Gupta, Tarushi, Tsering Wangzes, Shweta Jain

TL;DR

This work tackles unsupervised federated clustering with unlabeled edge data by introducing p-FClus, a three-phase, single-round personalization framework. It initializes locally, aggregates centers globally, and personalizes centers at the client level through a gradient-based objective that couples clustering cost with a regularization term to stay close to local centers. The approach yields lower mean per-point cost, reduced cost variance across clients, and lower maximum cost compared to state-of-the-art baselines across diverse datasets and data distributions, including real, synthetic, balanced, and intrinsic federated settings, for both $k$-means ($ll=2$) and $k$-medoids ($ll=1$). Its data-distribution-independence and one-round communication make it practically impactful for privacy-preserving, edge-based clustering, with future work addressing robustness to malicious participants and dynamic client participation.

Abstract

The rapid growth of data from edge devices has catalyzed the performance of machine learning algorithms. However, the data generated resides at client devices thus there are majorly two challenge faced by traditional machine learning paradigms - centralization of data for training and secondly for most the generated data the class labels are missing and there is very poor incentives to clients to manually label their data owing to high cost and lack of expertise. To overcome these issues, there have been initial attempts to handle unlabelled data in a privacy preserving distributed manner using unsupervised federated data clustering. The goal is partition the data available on clients into $k$ partitions (called clusters) without actual exchange of data. Most of the existing algorithms are highly dependent on data distribution patterns across clients or are computationally expensive. Furthermore, due to presence of skewed nature of data across clients in most of practical scenarios existing models might result in clients suffering high clustering cost making them reluctant to participate in federated process. To this, we are first to introduce the idea of personalization in federated clustering. The goal is achieve balance between achieving lower clustering cost and at same time achieving uniform cost across clients. We propose p-FClus that addresses these goal in a single round of communication between server and clients. We validate the efficacy of p-FClus against variety of federated datasets showcasing it's data independence nature, applicability to any finite $\ell$-norm, while simultaneously achieving lower cost and variance.

Fair Federated Data Clustering through Personalization: Bridging the Gap between Diverse Data Distributions

TL;DR

This work tackles unsupervised federated clustering with unlabeled edge data by introducing p-FClus, a three-phase, single-round personalization framework. It initializes locally, aggregates centers globally, and personalizes centers at the client level through a gradient-based objective that couples clustering cost with a regularization term to stay close to local centers. The approach yields lower mean per-point cost, reduced cost variance across clients, and lower maximum cost compared to state-of-the-art baselines across diverse datasets and data distributions, including real, synthetic, balanced, and intrinsic federated settings, for both -means () and -medoids (). Its data-distribution-independence and one-round communication make it practically impactful for privacy-preserving, edge-based clustering, with future work addressing robustness to malicious participants and dynamic client participation.

Abstract

The rapid growth of data from edge devices has catalyzed the performance of machine learning algorithms. However, the data generated resides at client devices thus there are majorly two challenge faced by traditional machine learning paradigms - centralization of data for training and secondly for most the generated data the class labels are missing and there is very poor incentives to clients to manually label their data owing to high cost and lack of expertise. To overcome these issues, there have been initial attempts to handle unlabelled data in a privacy preserving distributed manner using unsupervised federated data clustering. The goal is partition the data available on clients into partitions (called clusters) without actual exchange of data. Most of the existing algorithms are highly dependent on data distribution patterns across clients or are computationally expensive. Furthermore, due to presence of skewed nature of data across clients in most of practical scenarios existing models might result in clients suffering high clustering cost making them reluctant to participate in federated process. To this, we are first to introduce the idea of personalization in federated clustering. The goal is achieve balance between achieving lower clustering cost and at same time achieving uniform cost across clients. We propose p-FClus that addresses these goal in a single round of communication between server and clients. We validate the efficacy of p-FClus against variety of federated datasets showcasing it's data independence nature, applicability to any finite -norm, while simultaneously achieving lower cost and variance.
Paper Structure (25 sections, 9 equations, 6 figures, 4 tables, 4 algorithms)

This paper contains 25 sections, 9 equations, 6 figures, 4 tables, 4 algorithms.

Figures (6)

  • Figure 1: The plot shows the variation in evaluation metrics against proposed $\texttt{p-FClus}$ and $\texttt{SOTA}$ on $k$-means for varying heterogeneity levels on a Balanced data split across $100$ clients. Each column represents a dataset as specified at the top, and each row represents one metric under evaluation. Note that the FMNIST dataset is on $500$ clients. (Best viewed in color).
  • Figure 2: The plot shows the variation in evaluation metrics against proposed $\texttt{p-FClus}$ and $\texttt{SOTA}$ on $k$-means objective for varying heterogeneity levels on a Balanced data split across $1000$ clients. Each column represents a specific dataset as specified at the top, and each row represents one metric under evaluation. (Best viewed in color).
  • Figure 3: The plot shows the variation in evaluation metrics against proposed $\texttt{p-FClus}$ and $\texttt{SOTA}$ on $k$-means objective for varying heterogeneity levels on a Balanced data split across $1000$ clients. Each column represents a specific Synthetic dataset (Syn) in sequence: Syn-NO, Syn-LO, Syn-O respectively, and each row represents one metric under evaluation. (Best viewed in color).
  • Figure 4: The plot shows the variation in evaluation metrics against proposed $\texttt{p-FClus}$ and $\texttt{SOTA}$ on $k$-means objective for varying heterogeneity levels on a Unequal data split across $100$ clients. Each column represents a specific dataset as specified at the top, and each row represents one metric under evaluation. (Best viewed in color).
  • Figure 5: The plot shows the variation in evaluation metrics against proposed $\texttt{p-FClus}$ and $\texttt{SOTA}$ on $k$-means Objective for varying heterogeneity levels on a Unequal data split across $500$ clients. Each column represents a specific dataset as specified at the top, and each row represents one metric under evaluation. (Best viewed in color).
  • ...and 1 more figures

Theorems & Definitions (3)

  • Definition 1: Heterogeneity
  • Definition 2: Objective Cost
  • Definition 3: Fair Federated Clustering