A new type of federated clustering: A non-model-sharing approach
Yuji Kawamata, Kaoru Kamijo, Masateru Kihira, Akihiro Toyoda, Tomoru Nakayama, Akira Imakura, Tetsuya Sakurai, Yukihiko Okada
TL;DR
DC-Clustering presents a privacy-preserving, non-model-sharing federated clustering framework capable of handling complex horizontal-vertical data partitions. It uses an anchor-guided construction of intermediate representations and a collaborative representation to perform either k-means or spectral clustering in a single round of communication with a central analyst. Empirical results on synthetic and open datasets show clustering performance close to centralized pooling, addressing a key gap in federated clustering for mixed data distributions. The approach offers practical value for privacy-sensitive domains by enabling knowledge discovery from distributed heterogeneous data while reducing communication overhead.
Abstract
In recent years, the growing need to leverage sensitive data across institutions has led to increased attention on federated learning (FL), a decentralized machine learning paradigm that enables model training without sharing raw data. However, existing FL-based clustering methods, known as federated clustering, typically assume simple data partitioning scenarios such as horizontal or vertical splits, and cannot handle more complex distributed structures. This study proposes data collaboration clustering (DC-Clustering), a novel federated clustering method that supports clustering over complex data partitioning scenarios where horizontal and vertical splits coexist. In DC-Clustering, each institution shares only intermediate representations instead of raw data, ensuring privacy preservation while enabling collaborative clustering. The method allows flexible selection between k-means and spectral clustering, and achieves final results with a single round of communication with the central server. We conducted extensive experiments using synthetic and open benchmark datasets. The results show that our method achieves clustering performance comparable to centralized clustering where all data are pooled. DC-Clustering addresses an important gap in current FL research by enabling effective knowledge discovery from distributed heterogeneous data. Its practical properties -- privacy preservation, communication efficiency, and flexibility -- make it a promising tool for privacy-sensitive domains such as healthcare and finance.
