Table of Contents
Fetching ...

Diffusion Model-Based Data Synthesis Aided Federated Semi-Supervised Learning

Zhongwei Wang, Tong Wu, Zhiyong Chen, Liang Qian, Yin Xu, Meixia Tao

TL;DR

A novel approach, diffusion model-based data synthesis aided FSSL (DDSA-FSSL), which utilizes a diffusion model (DM) to generate synthetic data, bridging the gap between heterogeneous local data distributions and the global data distribution.

Abstract

Federated semi-supervised learning (FSSL) is primarily challenged by two factors: the scarcity of labeled data across clients and the non-independent and identically distribution (non-IID) nature of data among clients. In this paper, we propose a novel approach, diffusion model-based data synthesis aided FSSL (DDSA-FSSL), which utilizes a diffusion model (DM) to generate synthetic data, bridging the gap between heterogeneous local data distributions and the global data distribution. In DDSA-FSSL, clients address the challenge of the scarcity of labeled data by employing a federated learning-trained classifier to perform pseudo labeling for unlabeled data. The DM is then collaboratively trained using both labeled and precision-optimized pseudo-labeled data, enabling clients to generate synthetic samples for classes that are absent in their labeled datasets. This process allows clients to generate more comprehensive synthetic datasets aligned with the global distribution. Extensive experiments conducted on multiple datasets and varying non-IID distributions demonstrate the effectiveness of DDSA-FSSL, e.g., it improves accuracy from 38.46% to 52.14% on CIFAR-10 datasets with 10% labeled data.

Diffusion Model-Based Data Synthesis Aided Federated Semi-Supervised Learning

TL;DR

A novel approach, diffusion model-based data synthesis aided FSSL (DDSA-FSSL), which utilizes a diffusion model (DM) to generate synthetic data, bridging the gap between heterogeneous local data distributions and the global data distribution.

Abstract

Federated semi-supervised learning (FSSL) is primarily challenged by two factors: the scarcity of labeled data across clients and the non-independent and identically distribution (non-IID) nature of data among clients. In this paper, we propose a novel approach, diffusion model-based data synthesis aided FSSL (DDSA-FSSL), which utilizes a diffusion model (DM) to generate synthetic data, bridging the gap between heterogeneous local data distributions and the global data distribution. In DDSA-FSSL, clients address the challenge of the scarcity of labeled data by employing a federated learning-trained classifier to perform pseudo labeling for unlabeled data. The DM is then collaboratively trained using both labeled and precision-optimized pseudo-labeled data, enabling clients to generate synthetic samples for classes that are absent in their labeled datasets. This process allows clients to generate more comprehensive synthetic datasets aligned with the global distribution. Extensive experiments conducted on multiple datasets and varying non-IID distributions demonstrate the effectiveness of DDSA-FSSL, e.g., it improves accuracy from 38.46% to 52.14% on CIFAR-10 datasets with 10% labeled data.
Paper Structure (11 sections, 15 equations, 3 figures, 1 table, 1 algorithm)

This paper contains 11 sections, 15 equations, 3 figures, 1 table, 1 algorithm.

Figures (3)

  • Figure 1: Overview of the proposed DDSA-FSSL. In the first step, each client performs federated training of a global classifier using labeled data. In the second step, the global classifier performs pseudo-labeling for the unlabeled data at each client, followed by a precision-driven optimization process guided by the global confusion matrix $\mathcal{M}_g^t$ to refine and select high-quality pseudo-labeled samples. In the third step, clients collaboratively train the DMs using both the labeled and optimized pseudo-labeled data. In the fourth step, the DMs are employed by clients to generate specific synthetic data, based on discrepancies between local and global data distributions. Finally, clients conduct federated training of the classifier using both labeled and synthetic data.
  • Figure 2: The impacts of the ratio of labeled data on the performance under the condition of augmentation strength $\alpha=1$.
  • Figure 3: Precision and recall variations across classes.