Table of Contents
Fetching ...

Revisiting semi-supervised learning in the era of foundation models

Ping Zhang, Zheda Mai, Quang-Huy Nguyen, Wei-Lun Chao

TL;DR

This work interrogates how semi-supervised learning (SSL) interacts with vision foundation models (VFMs). It develops a VTAB-based SSL benchmark to reveal SSL behavior when backbones are frozen, finding that carefully tuned labeled-data fine-tuning with parameter-efficient methods often matches SSL performance, even with abundant unlabeled data. To capitalize on this, the authors propose a simple self-training baseline that ensembles pseudo-labels from multiple PEFT-VFM configurations, yielding robust improvements (V-PET) over traditional SSL methods. The results demonstrate a practical, scalable SSL pathway for the foundation-model era and argue for SSL approaches specifically designed for VFMs rather than scratch-oriented methods.

Abstract

Semi-supervised learning (SSL) leverages abundant unlabeled data alongside limited labeled data to enhance learning. As vision foundation models (VFMs) increasingly serve as the backbone of vision applications, it remains unclear how SSL interacts with these pre-trained models. To address this gap, we develop new SSL benchmark datasets where frozen VFMs underperform and systematically evaluate representative SSL methods. We make a surprising observation: parameter-efficient fine-tuning (PEFT) using only labeled data often matches SSL performance, even without leveraging unlabeled data. This motivates us to revisit self-training, a conceptually simple SSL baseline, where we use the supervised PEFT model to pseudo-label unlabeled data for further training. To overcome the notorious issue of noisy pseudo-labels, we propose ensembling multiple PEFT approaches and VFM backbones to produce more robust pseudo-labels. Empirical results validate the effectiveness of this simple yet powerful approach, providing actionable insights into SSL with VFMs and paving the way for more scalable and practical semi-supervised learning in the era of foundation models.

Revisiting semi-supervised learning in the era of foundation models

TL;DR

This work interrogates how semi-supervised learning (SSL) interacts with vision foundation models (VFMs). It develops a VTAB-based SSL benchmark to reveal SSL behavior when backbones are frozen, finding that carefully tuned labeled-data fine-tuning with parameter-efficient methods often matches SSL performance, even with abundant unlabeled data. To capitalize on this, the authors propose a simple self-training baseline that ensembles pseudo-labels from multiple PEFT-VFM configurations, yielding robust improvements (V-PET) over traditional SSL methods. The results demonstrate a practical, scalable SSL pathway for the foundation-model era and argue for SSL approaches specifically designed for VFMs rather than scratch-oriented methods.

Abstract

Semi-supervised learning (SSL) leverages abundant unlabeled data alongside limited labeled data to enhance learning. As vision foundation models (VFMs) increasingly serve as the backbone of vision applications, it remains unclear how SSL interacts with these pre-trained models. To address this gap, we develop new SSL benchmark datasets where frozen VFMs underperform and systematically evaluate representative SSL methods. We make a surprising observation: parameter-efficient fine-tuning (PEFT) using only labeled data often matches SSL performance, even without leveraging unlabeled data. This motivates us to revisit self-training, a conceptually simple SSL baseline, where we use the supervised PEFT model to pseudo-label unlabeled data for further training. To overcome the notorious issue of noisy pseudo-labels, we propose ensembling multiple PEFT approaches and VFM backbones to produce more robust pseudo-labels. Empirical results validate the effectiveness of this simple yet powerful approach, providing actionable insights into SSL with VFMs and paving the way for more scalable and practical semi-supervised learning in the era of foundation models.

Paper Structure

This paper contains 28 sections, 14 equations, 7 figures, 13 tables, 1 algorithm.

Figures (7)

  • Figure 1: Left: The Venn diagram of the top 20% highest-confidence predictions from various VFMs and PEFT under DTD 3-shot, illustrating their intrinsic property of producing a diverse range of high-confidence predictions. Right: Ensembling these diverse pseudo-label predictions progressively boosts downstream performance for increasingly more ensembles (Self-Training → PET → V-PET), highlighting the quality improvements from diversity. Results are averaged across 12 settings in our benchmark.
  • Figure 2: Illustration of V-PET. To effectively leverage abundant unlabeled data alongside scarce labeled data in the era of VFMs, our approach follows four phases: (a) Supervised Parameter-Efficient Fine-Tuning, where we harness labeled data by fine-tuning pre-trained VFMs using various PEFT algorithms; (b) Pseudo-Label Generation, where we exploit fine-tuned VFMs' generalization ability to generate pseudo-labels for unlabeled data; (c) Pseudo-Label Ensemble, where we enhance robustness by aggregating pseudo-labels from multiple fine-tuned VFMs; and (d) Self-Training, where we consolidate all knowledge into one model.
  • Figure 3: Average SSL accuracy with full fine-tuning or PEFT across 12 settings shows that, with fair hyper-parameter tuning, fine-tuning on limited labels can outperform SSL; PEFT boosts SSL yet matches labeled-only performance, indicating minimal unlabeled-data benefit in current VFM-based SSL (see \ref{['sec:peft_details']}).
  • Figure 4: The average entropy of predicted probability distributions from different fine-tuned VFMs, highlighting the entropy gap among pseudo-labels, indicating poor calibration.
  • Figure 5: Ranking frequency across SSL methods by the proposed benchmark. The number in $(i, j)$ indicates the frequency of method $i$ is ranked $j$-th across 12 settings. The number in brackets indicates the average rank, where the higher rank is better.
  • ...and 2 more figures