Table of Contents
Fetching ...

VertCoHiRF: Decentralized Vertical Clustering Beyond k-means

Bruno Belucci, Karim Lounici, Vladimir R. Kostic, Katia Meziani

TL;DR

VertCoHiRF introduces a fully decentralized vertical clustering framework that avoids sharing feature data and relies on structural consensus across heterogeneous local views. By exchanging only cluster codes and ordinal rankings, agents iteratively build a Cluster Fusion Hierarchy (CFH) through a veto-based two-phase protocol that selects representative medoids and shrinks the problem size. The method provides identifier-level privacy guarantees and robustness to Byzantine behavior, with theoretical bounds on communication complexity and empirical evidence showing competitive clustering performance across synthetic and real-world VFL scenarios. This structure-aware, privacy-preserving approach enables flexible, multi-view clustering beyond k-means and offers interpretable cross-view clustering through CFH. The work has practical impact for privacy-conscious, distributed data collaborations where feature spaces are fragmented and heterogeneous.

Abstract

Vertical Federated Learning (VFL) enables collaborative analysis across parties holding complementary feature views of the same samples, yet existing approaches are largely restricted to distributed variants of $k$-means, requiring centralized coordination or the exchange of feature-dependent numerical statistics, and exhibiting limited robustness under heterogeneous views or adversarial behavior. We introduce VertCoHiRF, a fully decentralized framework for vertical federated clustering based on structural consensus across heterogeneous views, allowing each agent to apply a base clustering method adapted to its local feature space in a peer-to-peer manner. Rather than exchanging feature-dependent statistics or relying on noise injection for privacy, agents cluster their local views independently and reconcile their proposals through identifier-level consensus. Consensus is achieved via decentralized ordinal ranking to select representative medoids, progressively inducing a shared hierarchical clustering across agents. Communication is limited to sample identifiers, cluster labels, and ordinal rankings, providing privacy by design while supporting overlapping feature partitions and heterogeneous local clustering methods, and yielding an interpretable shared Cluster Fusion Hierarchy (CFH) that captures cross-view agreement at multiple resolutions.We analyze communication complexity and robustness, and experiments demonstrate competitive clustering performance in vertical federated settings.

VertCoHiRF: Decentralized Vertical Clustering Beyond k-means

TL;DR

VertCoHiRF introduces a fully decentralized vertical clustering framework that avoids sharing feature data and relies on structural consensus across heterogeneous local views. By exchanging only cluster codes and ordinal rankings, agents iteratively build a Cluster Fusion Hierarchy (CFH) through a veto-based two-phase protocol that selects representative medoids and shrinks the problem size. The method provides identifier-level privacy guarantees and robustness to Byzantine behavior, with theoretical bounds on communication complexity and empirical evidence showing competitive clustering performance across synthetic and real-world VFL scenarios. This structure-aware, privacy-preserving approach enables flexible, multi-view clustering beyond k-means and offers interpretable cross-view clustering through CFH. The work has practical impact for privacy-conscious, distributed data collaborations where feature spaces are fragmented and heterogeneous.

Abstract

Vertical Federated Learning (VFL) enables collaborative analysis across parties holding complementary feature views of the same samples, yet existing approaches are largely restricted to distributed variants of -means, requiring centralized coordination or the exchange of feature-dependent numerical statistics, and exhibiting limited robustness under heterogeneous views or adversarial behavior. We introduce VertCoHiRF, a fully decentralized framework for vertical federated clustering based on structural consensus across heterogeneous views, allowing each agent to apply a base clustering method adapted to its local feature space in a peer-to-peer manner. Rather than exchanging feature-dependent statistics or relying on noise injection for privacy, agents cluster their local views independently and reconcile their proposals through identifier-level consensus. Consensus is achieved via decentralized ordinal ranking to select representative medoids, progressively inducing a shared hierarchical clustering across agents. Communication is limited to sample identifiers, cluster labels, and ordinal rankings, providing privacy by design while supporting overlapping feature partitions and heterogeneous local clustering methods, and yielding an interpretable shared Cluster Fusion Hierarchy (CFH) that captures cross-view agreement at multiple resolutions.We analyze communication complexity and robustness, and experiments demonstrate competitive clustering performance in vertical federated settings.
Paper Structure (29 sections, 2 theorems, 9 equations, 6 figures, 16 tables, 1 algorithm)

This paper contains 29 sections, 2 theorems, 9 equations, 6 figures, 16 tables, 1 algorithm.

Key Result

Proposition 3.1

At iteration $e$, the communication cost $b^{(e)}$ of VertCoHiRF with $A$ agents expressed in bits is bounded by

Figures (6)

  • Figure 1: Empirical robustness of Phase 2 under a single Byzantine attacker. Performance degrades smoothly as the consensus quality decreases with noise $\sigma$.
  • Figure 2: Local views of the synthetic multimodal dataset. Each agent observes a coherent but partial clustering structure in its own feature space.
  • Figure 3: Visualization of the 6-cluster ground-truth partition projected onto each agent’s feature space.
  • Figure 4: Clustering performance as a function of the number of collaborative agents, measured by ARI.
  • Figure 5: ARI boxplots across feature partitions. The blue dashed line shows the non-collaborative local $k$-means reference (mean median of agents across five random feature partitions).
  • ...and 1 more figures

Theorems & Definitions (3)

  • Proposition 3.1: Communication complexity
  • Definition 3.2: Identifier-level structural privacy
  • Proposition 3.3: Privacy by construction