SemiDFL: A Semi-Supervised Paradigm for Decentralized Federated Learning
Xinyang Liu, Pengchao Han, Xuan Li, Bo Liu
TL;DR
SemiDFL tackles the challenge of semi-supervised learning in decentralized federated learning where clients hold diverse labeled and unlabeled data under highly non-IID conditions. It introduces consensus in both model and data spaces by combining neighborhood pseudo-labeling, a diffusion-model based consensus for data synthesis, and adaptive aggregation to fuse neighbor models based on synthesized-data performance. The approach yields a unified consensus data space via diffusion-generated samples and a consensus model space through adaptive neighbor weighting, leading to improved classifier training without sharing raw data. Extensive experiments on MNIST, Fashion-MNIST, and CIFAR-10 demonstrate that SemiDFL consistently outperforms existing SSL baselines in DFL settings and closely approaches the centralized upper bound, highlighting its practical potential for privacy-preserving, scalable SSL in distributed systems.
Abstract
Decentralized federated learning (DFL) realizes cooperative model training among connected clients without relying on a central server, thereby mitigating communication bottlenecks and eliminating the single-point failure issue present in centralized federated learning (CFL). Most existing work on DFL focuses on supervised learning, assuming each client possesses sufficient labeled data for local training. However, in real-world applications, much of the data is unlabeled. We address this by considering a challenging yet practical semisupervised learning (SSL) scenario in DFL, where clients may have varying data sources: some with few labeled samples, some with purely unlabeled data, and others with both. In this work, we propose SemiDFL, the first semi-supervised DFL method that enhances DFL performance in SSL scenarios by establishing a consensus in both data and model spaces. Specifically, we utilize neighborhood information to improve the quality of pseudo-labeling, which is crucial for effectively leveraging unlabeled data. We then design a consensusbased diffusion model to generate synthesized data, which is used in combination with pseudo-labeled data to create mixed datasets. Additionally, we develop an adaptive aggregation method that leverages the model accuracy of synthesized data to further enhance SemiDFL performance. Through extensive experimentation, we demonstrate the remarkable performance superiority of the proposed DFL-Semi method over existing CFL and DFL schemes in both IID and non-IID SSL scenarios.
