Table of Contents
Fetching ...

Federated Hierarchical Clustering with Automatic Selection of Optimal Cluster Numbers

Yue Zhang, Chuanlong Qiu, Xinfa Liao, Yiqun Zhang

Abstract

Federated Clustering (FC) is an emerging and promising solution in exploring data distribution patterns from distributed and privacy-protected data in an unsupervised manner. Existing FC methods implicitly rely on the assumption that clients are with a known number of uniformly sized clusters. However, the true number of clusters is typically unknown, and cluster sizes are naturally imbalanced in real scenarios. Furthermore, the privacy-preserving transmission constraints in federated learning inevitably reduce usable information, making the development of robust and accurate FC extremely challenging. Accordingly, we propose a novel FC framework named Fed-$k^*$-HC, which can automatically determine an optimal number of clusters $k^*$ based on the data distribution explored through hierarchical clustering. To obtain the global data distribution for $k^*$ determination, we let each client generate micro-subclusters. Their prototypes are then uploaded to the server for hierarchical merging. The density-based merging design allows exploring clusters of varying sizes and shapes, and the progressive merging process can self-terminate according to the neighboring relationships among the prototypes to determine $k^*$. Extensive experiments on diverse datasets demonstrate the FC capability of the proposed Fed-$k^*$-HC in accurately exploring a proper number of clusters.

Federated Hierarchical Clustering with Automatic Selection of Optimal Cluster Numbers

Abstract

Federated Clustering (FC) is an emerging and promising solution in exploring data distribution patterns from distributed and privacy-protected data in an unsupervised manner. Existing FC methods implicitly rely on the assumption that clients are with a known number of uniformly sized clusters. However, the true number of clusters is typically unknown, and cluster sizes are naturally imbalanced in real scenarios. Furthermore, the privacy-preserving transmission constraints in federated learning inevitably reduce usable information, making the development of robust and accurate FC extremely challenging. Accordingly, we propose a novel FC framework named Fed--HC, which can automatically determine an optimal number of clusters based on the data distribution explored through hierarchical clustering. To obtain the global data distribution for determination, we let each client generate micro-subclusters. Their prototypes are then uploaded to the server for hierarchical merging. The density-based merging design allows exploring clusters of varying sizes and shapes, and the progressive merging process can self-terminate according to the neighboring relationships among the prototypes to determine . Extensive experiments on diverse datasets demonstrate the FC capability of the proposed Fed--HC in accurately exploring a proper number of clusters.
Paper Structure (18 sections, 26 equations, 7 figures, 7 tables, 3 algorithms)

This paper contains 18 sections, 26 equations, 7 figures, 7 tables, 3 algorithms.

Figures (7)

  • Figure 1: Overview of the proposed Fed-$k^*$-HC framework. Each client first applies the SNP algorithm to partition its local data (black dots) into multiple micro-subclusters (colored dots) and computes the centroid of each subcluster (black stars). These centroids are then uploaded to the server. The server aggregates all client subclusters (colored dots), reinitializes global centroids (colored stars), and performs clustering using corresponding neighborhood-based algorithms. The number of clusters $k^*$ is automatically estimated via a neighborhood-based method, without requiring manual specification.
  • Figure 2: To intuitively illustrate the construction process of SNN, a simplified example is presented using several sample points, and their nearest neighbor (NN) structures are visualized under different neighborhood sizes. In the SNN definition, a pair of points is considered an SNN pair only if they are mutual nearest neighbors, i.e., each lies in the $b$-nearest neighbor list of the other. In the diagram, such mutually connected pairs are denoted with solid lines. For example, under $m$ = 2, the neighbors are as follows: A → {B, C}, B → {A, D}, C → {A, E}, D → {A, B}, E → {A, C},F → {G, H}, G → {F, H}, H → {I, J}, I → {H, J}, J → {H, I}. Only pairs that appear in both nodes’ neighbor lists qualify as SNN pairs. For instance, since A and B connect each other, the pair (A, B) is an SNN pair, according to Definition 2. In contrast, although D has A as a neighbor, A does not connect D. As a result, (A, D) is not judged as an SNN pair. Accordingly, the identified SNN pairs under $m$ = 2 are: (A, B), (A, C), (B, D), (C, E), (F, G), (H, I), (H, J), (I, J), which form three clusters $C_{1}, C_{2},C_{3}$.
  • Figure 3: p-values of the Wilcoxon signed rank test in comparing our method against the other methods on five metrics.
  • Figure 4: Parameter sensitivity analysis on $t$. The symbol $K$ denotes the ground-truth number of clusters in each dataset. The results are plotted demonstrating how the proposed method can automatically infer the correct cluster number across datasets.
  • Figure 5: The variation curve of $P$ as $b$ changes on different datasets. The impact of the $b$-nearest neighbor parameter ($b$) used in the SNC algorithm on the proportion of short-distance point pairs ($P$) across different datasets. From the graph, it can be seen that the proportion of short-distance point pairs ($P$) shows a similar trend as $b$ increases across all datasets, which aligns with our expectations.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Remark 1