Table of Contents
Fetching ...

SemiReward: A General Reward Model for Semi-supervised Learning

Siyuan Li, Weiyang Jin, Zedong Wang, Fang Wu, Zicheng Liu, Cheng Tan, Stan Z. Li

TL;DR

This work tackles the core SSL problem of unreliable pseudo-labels and confirmation bias by introducing SemiReward, a general rewarder that outputs a calibrated score $r\in[0,1]$ to filter pseudo labels. The rewarder is trained online in a two-stage workflow with a lightweight generator to decouple student training from reward estimation, using a cosine-based label similarity $\mathcal{S}(y^u,y^l)$ as the target for $\mathcal{R}$. Empirically, SemiReward yields substantial accuracy gains and faster convergence across 13 SSL benchmarks spanning CV, NLP, and Audio, and it remains compatible with diverse SSL methods like Pseudo Label, FlexMatch, and Free/SoftMatch. This approach offers a practical, modular enhancement for SSL that improves label quality without large overhead, broadening applicability across tasks and modalities.

Abstract

Semi-supervised learning (SSL) has witnessed great progress with various improvements in the self-training framework with pseudo labeling. The main challenge is how to distinguish high-quality pseudo labels against the confirmation bias. However, existing pseudo-label selection strategies are limited to pre-defined schemes or complex hand-crafted policies specially designed for classification, failing to achieve high-quality labels, fast convergence, and task versatility simultaneously. To these ends, we propose a Semi-supervised Reward framework (SemiReward) that predicts reward scores to evaluate and filter out high-quality pseudo labels, which is pluggable to mainstream SSL methods in wide task types and scenarios. To mitigate confirmation bias, SemiReward is trained online in two stages with a generator model and subsampling strategy. With classification and regression tasks on 13 standard SSL benchmarks across three modalities, extensive experiments verify that SemiReward achieves significant performance gains and faster convergence speeds upon Pseudo Label, FlexMatch, and Free/SoftMatch. Code and models are available at https://github.com/Westlake-AI/SemiReward.

SemiReward: A General Reward Model for Semi-supervised Learning

TL;DR

This work tackles the core SSL problem of unreliable pseudo-labels and confirmation bias by introducing SemiReward, a general rewarder that outputs a calibrated score to filter pseudo labels. The rewarder is trained online in a two-stage workflow with a lightweight generator to decouple student training from reward estimation, using a cosine-based label similarity as the target for . Empirically, SemiReward yields substantial accuracy gains and faster convergence across 13 SSL benchmarks spanning CV, NLP, and Audio, and it remains compatible with diverse SSL methods like Pseudo Label, FlexMatch, and Free/SoftMatch. This approach offers a practical, modular enhancement for SSL that improves label quality without large overhead, broadening applicability across tasks and modalities.

Abstract

Semi-supervised learning (SSL) has witnessed great progress with various improvements in the self-training framework with pseudo labeling. The main challenge is how to distinguish high-quality pseudo labels against the confirmation bias. However, existing pseudo-label selection strategies are limited to pre-defined schemes or complex hand-crafted policies specially designed for classification, failing to achieve high-quality labels, fast convergence, and task versatility simultaneously. To these ends, we propose a Semi-supervised Reward framework (SemiReward) that predicts reward scores to evaluate and filter out high-quality pseudo labels, which is pluggable to mainstream SSL methods in wide task types and scenarios. To mitigate confirmation bias, SemiReward is trained online in two stages with a generator model and subsampling strategy. With classification and regression tasks on 13 standard SSL benchmarks across three modalities, extensive experiments verify that SemiReward achieves significant performance gains and faster convergence speeds upon Pseudo Label, FlexMatch, and Free/SoftMatch. Code and models are available at https://github.com/Westlake-AI/SemiReward.
Paper Structure (43 sections, 7 equations, 11 figures, 19 tables, 1 algorithm)

This paper contains 43 sections, 7 equations, 11 figures, 19 tables, 1 algorithm.

Figures (11)

  • Figure 1: SemiReward (abbreviated as SR) enables existing SSL methods to select high-quality pseudo labels on both classification and regression tasks with fast convergence speeds (Figure \ref{['fig:acc_vs_iter']}). Error rates of SSL algorithms are plotted on CV, NLP, and Audio datasets. Note that previous SOTA marks the best performance among a set of methods, which denotes 4 general SSL methods used for classification and regression tasks in (a) and 17 SSL methods in USB nips2022usb in (b). SemiReward noticeably improves performance when plugged into existing SSL methods.
  • Figure 2: Top-1 accuracy v.s. training iterations ($\times$2048) on SSL datasets (the number of used labels) of three modalities. Employing SemiReward with SOTA SSL methods produces +1.9$\sim$3.7 performance gains with at least 1.7 times fewer training iterations compared to the baseline. We apply early-stop when the validation performance reaches the peak.
  • Figure 3: Illustration of SSL training paradigm, where blue lines denote pseudo-labeling pipeline and red lines denote gradient propagation. (a) Confidenced-based label selection strategy and strong-weak augmentations for consistency are task-specific and modality-specific (requiring ad-hoc augmentations). (b) Rewarder $\mathcal{R}$ is a plug-and-play label selection module for general SSL scenarios.
  • Figure 4: How rewarder works illustrated by reward scores v.s. top-1 accuracy on CIFAR-100 (400 labels). (a) Analysis of alternative reward similarities; (b) Ablation of cross-attention module in $\mathcal{R}$, which is the vital component to learn calibrated reward scores; (c) Ablation of MLP layers.
  • Figure 5: Credible reward scores ensure the stable optimization of the student model, while raw pseudo labels in general SSL methods gravely misled the student for regression task on RCF-MNIST (1% labels).
  • ...and 6 more figures