Table of Contents
Fetching ...

CATCHFed: Efficient Unlabeled Data Utilization for Semi-Supervised Federated Learning in Limited Labels Environments

Byoungjun Park, Pedro Porto Buarque de Gusmão, Dongjin Ji, Minhoe Kim

TL;DR

CATCHFed tackles the challenge of extremely label-scarce semi-supervised federated learning by introducing three mechanisms: client-aware adaptive warm-up thresholds (CAWT) that adjust per-class thresholds for each client, a hybrid energy-based thresholding scheme to improve pseudo-label quality, and consistency regularization that leverages unpseudo-labeled data. The approach maximizes unlabeled data usage by enabling pseudo-labeling only when both confidence and distribution-alignment criteria are met, while still benefiting from discarded samples through consistency losses. Empirical results across CIFAR-10/100 and SVHN under IID and Non-IID settings show CATCHFed consistently outperforms strong baselines, often by notable margins, especially when server labels are extremely limited. The work also provides insights into energy-threshold tuning and calibration, highlighting practical implications for deploying SSFL in real-world, privacy-preserving settings.

Abstract

Federated learning is a promising paradigm that utilizes distributed client resources while preserving data privacy. Most existing FL approaches assume clients possess labeled data, however, in real-world scenarios, client-side labels are often unavailable. Semi-supervised Federated learning, where only the server holds labeled data, addresses this issue. However, it experiences significant performance degradation as the number of labeled data decreases. To tackle this problem, we propose \textit{CATCHFed}, which introduces client-aware adaptive thresholds considering class difficulty, hybrid thresholds to enhance pseudo-label quality, and utilizes unpseudo-labeled data for consistency regularization. Extensive experiments across various datasets and configurations demonstrate that CATCHFed effectively leverages unlabeled client data, achieving superior performance even in extremely limited-label settings.

CATCHFed: Efficient Unlabeled Data Utilization for Semi-Supervised Federated Learning in Limited Labels Environments

TL;DR

CATCHFed tackles the challenge of extremely label-scarce semi-supervised federated learning by introducing three mechanisms: client-aware adaptive warm-up thresholds (CAWT) that adjust per-class thresholds for each client, a hybrid energy-based thresholding scheme to improve pseudo-label quality, and consistency regularization that leverages unpseudo-labeled data. The approach maximizes unlabeled data usage by enabling pseudo-labeling only when both confidence and distribution-alignment criteria are met, while still benefiting from discarded samples through consistency losses. Empirical results across CIFAR-10/100 and SVHN under IID and Non-IID settings show CATCHFed consistently outperforms strong baselines, often by notable margins, especially when server labels are extremely limited. The work also provides insights into energy-threshold tuning and calibration, highlighting practical implications for deploying SSFL in real-world, privacy-preserving settings.

Abstract

Federated learning is a promising paradigm that utilizes distributed client resources while preserving data privacy. Most existing FL approaches assume clients possess labeled data, however, in real-world scenarios, client-side labels are often unavailable. Semi-supervised Federated learning, where only the server holds labeled data, addresses this issue. However, it experiences significant performance degradation as the number of labeled data decreases. To tackle this problem, we propose \textit{CATCHFed}, which introduces client-aware adaptive thresholds considering class difficulty, hybrid thresholds to enhance pseudo-label quality, and utilizes unpseudo-labeled data for consistency regularization. Extensive experiments across various datasets and configurations demonstrate that CATCHFed effectively leverages unlabeled client data, achieving superior performance even in extremely limited-label settings.

Paper Structure

This paper contains 24 sections, 18 equations, 9 figures, 5 tables, 1 algorithm.

Figures (9)

  • Figure 1: CIFAR-10 accuracy comparison by labeled sample count (FlexMatchsohn2020fixmatch, $(FL)^2$lee20242, SemiFLdiao2022semifl, FedMatchjeong2020federated, and FedConlong2021fedcon)
  • Figure 2: Comparison of utilization ratio and pseudo-label accuracy between FlexMatchzhang2021flexmatch and SemiFLdiao2022semifl on CIFAR-10 with 20 labeled samples.
  • Figure 3: An overall pipeline of CATCHFed. Blue-bordered components are server-side processes, while gray and black-bordered components are client-side processes.
  • Figure 4: Visualization of pseudo-label selection regions by (a) confidence-based, (b) energy-based, and (c) hybrid thresholding. Shaded regions indicate unlabeled samples selected for pseudo-labeling.
  • Figure 5: Impact of Hybrid Thresholding on pseudo-label quality (CIFAR-10, IID 40 labels).
  • ...and 4 more figures