LayerMatch: Do Pseudo-labels Benefit All Layers?

Chaoqi Liang; Guanglei Yang; Lifeng Qiao; Zitong Huang; Hongliang Yan; Yunchao Wei; Wangmeng Zuo

LayerMatch: Do Pseudo-labels Benefit All Layers?

Chaoqi Liang, Guanglei Yang, Lifeng Qiao, Zitong Huang, Hongliang Yan, Yunchao Wei, Wangmeng Zuo

TL;DR

LayerMatch challenges the assumption that pseudo-labels uniformly benefit all layers in SSL by revealing distinct learning dynamics between the feature extractor and the linear classifier. It introduces Grad-ReLU to block unsupervised-gradient influence on the classifier while preserving it for the feature extractor, and Avg-Clustering to EMA-smooth feature clustering centers, yielding a cohesive LayerMatch objective. Empirically, LayerMatch delivers consistent improvements across CIFAR-10/100, STL-10, and ImageNet-100, achieving an average gain of $2.44\%$ over SOTA and $10.38\%$ over FixMatch. This layer-aware approach highlights the importance of tailoring pseudo-label usage to the learning role of each network component, with practical impact on improving SSL performance under limited labeled data.

Abstract

Deep neural networks have achieved remarkable performance across various tasks when supplied with large-scale labeled data. However, the collection of labeled data can be time-consuming and labor-intensive. Semi-supervised learning (SSL), particularly through pseudo-labeling algorithms that iteratively assign pseudo-labels for self-training, offers a promising solution to mitigate the dependency of labeled data. Previous research generally applies a uniform pseudo-labeling strategy across all model layers, assuming that pseudo-labels exert uniform influence throughout. Contrasting this, our theoretical analysis and empirical experiment demonstrate feature extraction layer and linear classification layer have distinct learning behaviors in response to pseudo-labels. Based on these insights, we develop two layer-specific pseudo-label strategies, termed Grad-ReLU and Avg-Clustering. Grad-ReLU mitigates the impact of noisy pseudo-labels by removing the gradient detrimental effects of pseudo-labels in the linear classification layer. Avg-Clustering accelerates the convergence of feature extraction layer towards stable clustering centers by integrating consistent outputs. Our approach, LayerMatch, which integrates these two strategies, can avoid the severe interference of noisy pseudo-labels in the linear classification layer while accelerating the clustering capability of the feature extraction layer. Through extensive experimentation, our approach consistently demonstrates exceptional performance on standard semi-supervised learning benchmarks, achieving a significant improvement of 10.38% over baseline method and a 2.44% increase compared to state-of-the-art methods.

LayerMatch: Do Pseudo-labels Benefit All Layers?

TL;DR

over SOTA and

over FixMatch. This layer-aware approach highlights the importance of tailoring pseudo-label usage to the learning role of each network component, with practical impact on improving SSL performance under limited labeled data.

Abstract

Paper Structure (20 sections, 4 theorems, 21 equations, 3 figures, 7 tables)

This paper contains 20 sections, 4 theorems, 21 equations, 3 figures, 7 tables.

Introduction
Related Work
Semi-supervised Learning Methods
Data Selection
Preliminaries
Method
Motivation
LayerMatch
Grad-ReLU
Avg-Clustering
Experiments
Implementation Details
Results
Ablation Study
Conclusion
...and 5 more sections

Key Result

Lemma 4.1

Equation (LU_se) leads to a simplified integral expression for consistency regularization loss function: where $\mathcal{D}$ represents the continuous input data space spanned by all unlabeled data under infinite data augmentation, and $\nabla_\mathbf{x}$ represents the gradient operator with respect to the input $\mathbf{x}$.

Figures (3)

Figure 1: The training curves on CIFAR-10 with a classic semi-supervised learning (SSL) setup. In this setup, CIFAR-10 dataset includes 50,000 training samples, with only 40 labeled and 49,960 unlabeled samples. Models are evaluated on a 10,000 samples test set.
Figure 2: The example figure illustrates the data features $\mathcal{M}$ generated by the feature extraction layer $\Theta$, and explains how noisy pseudo-labels can be harmful for the linear classification layer. The "triangle $\triangle$" and "circle $\bigcirc$" represent different classes. (a) The optimal linear classification layer, which is the ideal linear classification layer constructed based on labels of all data; (b) Learning from labeled data only, where the model primarily relies on labeled data to construct the linear classification layer. Selects pseudo-labels based on the threshold $\tau$; (c) Learning from both labels and pseudo-labels, where the model uses pseudo-labels generated from unlabeled data for training. However, due to the potential noise in pseudo-labels, the linear classification layer shifts and fails to accurately distinguish between different classes.
Figure 3: Ablation study of Grad-ReLU on CIFAR-10 (10) with pre-trained ViT.

Theorems & Definitions (6)

Lemma 4.1
Theorem 4.2
Lemma 4.1
proof
Theorem 4.2
proof

LayerMatch: Do Pseudo-labels Benefit All Layers?

TL;DR

Abstract

LayerMatch: Do Pseudo-labels Benefit All Layers?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (6)