Table of Contents
Fetching ...

Distributed clustering in partially overlapping feature spaces

Alessio Maritan, Luca Schenato

TL;DR

This work addresses distributed clustering when participants observe only partially overlapping feature spaces, formalizing a star-topology setting with a central server and feature masks. It proposes two algorithms: a federated clustering method that iteratively updates $K$ global centroids and a one-shot method that uses locally fitted cluster distributions to generate synthetic proxy datasets for global merging. The authors provide convergence-related arguments under feasibility assumptions, analyze computational costs, and establish proxy-data quality bounds via $W_1$ and total-variation distances, complemented by numerical experiments on three public datasets. The results show that both approaches achieve clustering quality close to centralized baselines while preserving privacy and reducing communication, highlighting practical applicability in domains like healthcare with distributed, heterogeneous data. Overall, the paper advances privacy-preserving distributed clustering for partially observable features and lays groundwork for future fully distributed or kernel-based extensions.

Abstract

We introduce and address a novel distributed clustering problem where each participant has a private dataset containing only a subset of all available features, and some features are included in multiple datasets. This scenario occurs in many real-world applications, such as in healthcare, where different institutions have complementary data on similar patients. We propose two different algorithms suitable for solving distributed clustering problems that exhibit this type of feature space heterogeneity. The first is a federated algorithm in which participants collaboratively update a set of global centroids. The second is a one-shot algorithm in which participants share a statistical parametrization of their local clusters with the central server, who generates and merges synthetic proxy datasets. In both cases, participants perform local clustering using algorithms of their choice, which provides flexibility and personalized computational costs. Pretending that local datasets result from splitting and masking an initial centralized dataset, we identify some conditions under which the proposed algorithms are expected to converge to the optimal centralized solution. Finally, we test the practical performance of the algorithms on three public datasets.

Distributed clustering in partially overlapping feature spaces

TL;DR

This work addresses distributed clustering when participants observe only partially overlapping feature spaces, formalizing a star-topology setting with a central server and feature masks. It proposes two algorithms: a federated clustering method that iteratively updates global centroids and a one-shot method that uses locally fitted cluster distributions to generate synthetic proxy datasets for global merging. The authors provide convergence-related arguments under feasibility assumptions, analyze computational costs, and establish proxy-data quality bounds via and total-variation distances, complemented by numerical experiments on three public datasets. The results show that both approaches achieve clustering quality close to centralized baselines while preserving privacy and reducing communication, highlighting practical applicability in domains like healthcare with distributed, heterogeneous data. Overall, the paper advances privacy-preserving distributed clustering for partially observable features and lays groundwork for future fully distributed or kernel-based extensions.

Abstract

We introduce and address a novel distributed clustering problem where each participant has a private dataset containing only a subset of all available features, and some features are included in multiple datasets. This scenario occurs in many real-world applications, such as in healthcare, where different institutions have complementary data on similar patients. We propose two different algorithms suitable for solving distributed clustering problems that exhibit this type of feature space heterogeneity. The first is a federated algorithm in which participants collaboratively update a set of global centroids. The second is a one-shot algorithm in which participants share a statistical parametrization of their local clusters with the central server, who generates and merges synthetic proxy datasets. In both cases, participants perform local clustering using algorithms of their choice, which provides flexibility and personalized computational costs. Pretending that local datasets result from splitting and masking an initial centralized dataset, we identify some conditions under which the proposed algorithms are expected to converge to the optimal centralized solution. Finally, we test the practical performance of the algorithms on three public datasets.

Paper Structure

This paper contains 31 sections, 29 equations, 7 figures, 5 tables, 4 algorithms.

Figures (7)

  • Figure 1: We consider the scenario in which participants collect similar data, but each participant observes only a fixed subset of features and the sets of features observed by different participants partially overlap. In the figure, $X_1, X_2, X_3$ are the local datasets of three participants, where each column is a data point and the features observed by the participants are indicated using colors.
  • Figure 2: In order to analyze the proposed algorithms, we pretend that the local datasets were generated by splitting a fictitious central dataset $\bar{X}$ and masking each dataset $\bar{X}_i$ with a diagonal, binary, projection matrix $\Omega_i$.
  • Figure 3: Example of computation of optimal cluster centroid, see \ref{['eq:optimal_masked_centroids']}.
  • Figure 4: Situation in which the combination of masking and local bias due to heterogeneous data distributions prevents the merging of local clusters associated with the same true cluster. The colored entries in the column vectors on the left represent the feature spaces of the participants. Due to masking, the local solution of participant $j$ can only be compared with that of participant $l$, but due to bias, their local clusters are too different to be merged. The similarity between the local cluster of participant $j$ and the ones of participants $i$ and $k$ cannot be detected due to masking.
  • Figure 5: In the figure, data points with different shapes belong to different true global clusters. Different colors indicate different local clusters, and the corresponding centroids are shown in black. Both local clusters and true global clusters are well described by their centroids. In this example, aggregating the two centroids at minimum distance (green dots and purple triangles) results in wrong clustering. This is due to the fact that local data belonging to the same true cluster have very different distributions.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Remark 1
  • Remark 2