Table of Contents
Fetching ...

CGMatch: A Different Perspective of Semi-supervised Learning

Bo Cheng, Jueqing Lu, Yuan Tian, Haifeng Zhao, Yi Chang, Lan Du

TL;DR

CGMatch tackles semi-supervised learning under very limited labeling by introducing Count-Gap (CG), a data-map based metric that captures label-prediction dynamics beyond confidence. It integrates Fine-grained Dynamic Selection (FDS) to partition unlabeled samples into easy-to-learn, ambiguous, and hard-to-learn subsets, applying tailored regularization (CE for easy, generalized cross-entropy for ambiguous) to mitigate noisy pseudo-labels. The method demonstrates strong performance on common SSL benchmarks when labeled data are scarce, with efficient early training via higher unlabeled data utilization and better calibration, albeit with some limitations on open-set scenarios like STL10. The approach is adaptable to existing SSL frameworks and requires minimal extra hyperparameters, offering practical gains for real-world SSL deployment.

Abstract

Semi-supervised learning (SSL) has garnered significant attention due to its ability to leverage limited labeled data and a large amount of unlabeled data to improve model generalization performance. Recent approaches achieve impressive successes by combining ideas from both consistency regularization and pseudo-labeling. However, these methods tend to underperform in the more realistic situations with relatively scarce labeled data. We argue that this issue arises because existing methods rely solely on the model's confidence, making them challenging to accurately assess the model's state and identify unlabeled examples contributing to the training phase when supervision information is limited, especially during the early stages of model training. In this paper, we propose a novel SSL model called CGMatch, which, for the first time, incorporates a new metric known as Count-Gap (CG). We demonstrate that CG is effective in discovering unlabeled examples beneficial for model training. Along with confidence, a commonly used metric in SSL, we propose a fine-grained dynamic selection (FDS) strategy. This strategy dynamically divides the unlabeled dataset into three subsets with different characteristics: easy-to-learn set, ambiguous set, and hard-to-learn set. By selective filtering subsets, and applying corresponding regularization with selected subsets, we mitigate the negative impact of incorrect pseudo-labels on model optimization and generalization. Extensive experimental results on several common SSL benchmarks indicate the effectiveness of CGMatch especially when the labeled data are particularly limited. Source code is available at https://github.com/BoCheng-96/CGMatch.

CGMatch: A Different Perspective of Semi-supervised Learning

TL;DR

CGMatch tackles semi-supervised learning under very limited labeling by introducing Count-Gap (CG), a data-map based metric that captures label-prediction dynamics beyond confidence. It integrates Fine-grained Dynamic Selection (FDS) to partition unlabeled samples into easy-to-learn, ambiguous, and hard-to-learn subsets, applying tailored regularization (CE for easy, generalized cross-entropy for ambiguous) to mitigate noisy pseudo-labels. The method demonstrates strong performance on common SSL benchmarks when labeled data are scarce, with efficient early training via higher unlabeled data utilization and better calibration, albeit with some limitations on open-set scenarios like STL10. The approach is adaptable to existing SSL frameworks and requires minimal extra hyperparameters, offering practical gains for real-world SSL deployment.

Abstract

Semi-supervised learning (SSL) has garnered significant attention due to its ability to leverage limited labeled data and a large amount of unlabeled data to improve model generalization performance. Recent approaches achieve impressive successes by combining ideas from both consistency regularization and pseudo-labeling. However, these methods tend to underperform in the more realistic situations with relatively scarce labeled data. We argue that this issue arises because existing methods rely solely on the model's confidence, making them challenging to accurately assess the model's state and identify unlabeled examples contributing to the training phase when supervision information is limited, especially during the early stages of model training. In this paper, we propose a novel SSL model called CGMatch, which, for the first time, incorporates a new metric known as Count-Gap (CG). We demonstrate that CG is effective in discovering unlabeled examples beneficial for model training. Along with confidence, a commonly used metric in SSL, we propose a fine-grained dynamic selection (FDS) strategy. This strategy dynamically divides the unlabeled dataset into three subsets with different characteristics: easy-to-learn set, ambiguous set, and hard-to-learn set. By selective filtering subsets, and applying corresponding regularization with selected subsets, we mitigate the negative impact of incorrect pseudo-labels on model optimization and generalization. Extensive experimental results on several common SSL benchmarks indicate the effectiveness of CGMatch especially when the labeled data are particularly limited. Source code is available at https://github.com/BoCheng-96/CGMatch.

Paper Structure

This paper contains 16 sections, 9 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Paradigm of X-Match. Given an unlabeled sample, the weakly-augmented image is fed into the model, and the sample is assigned to a corresponding class (i.e., pseudo-label) when its highest prediction probability exceeds the confidence threshold. This threshold is determined by various strategies proposed in X-Match. Consistency regularization is then applied to ensure that the prediction from the strongly-augmented version remains consistent with the pseudo-label.
  • Figure 2: Data map for the CIFAR10 dataset with 40 labels regarding Count-Gap. The x-axis represents variability, and the y-axis reflects confidence. Colors indicate the Count-Gap (CG). In this data map, the top-left corner (low variability, high confidence) highlights easy-to-learn examples, the bottom-left corner (low variability, low confidence) identifies hard-to-learn examples, while examples on the right (high variability) are categorized as ambiguous. It is evident that Count-Gap is effective in distinguishing these three types of subsets within the context of SSL.
  • Figure 3: The framework of the proposed CGMatch. First, one network takes the unlabeled samples with diverse augmentations as input and outputs the corresponding prediction distributions, which is necessary for consistency regularization. Then, a fine-grained dynamic selection (FDS) strategy is designed by taking the class predictions and Count-Gap distributions of weakly-augmented versions into account, which is utilized to divide the unlabeled data into three subsets: easy-to-learn set, ambiguous set, and hard-to-learn set. Finally, different regularization techniques are employed to involve easy-to-learn samples and ambiguous samples into model training, aiding both model optimization and generation.
  • Figure 4: Accuracy in the early stages
  • Figure 5: ECE in the early stages
  • ...and 6 more figures