Table of Contents
Fetching ...

Keep It on a Leash: Controllable Pseudo-label Generation Towards Realistic Long-Tailed Semi-Supervised Learning

Yaxin Hou, Bo Han, Yuheng Jia, Hui Liu, Junhui Hou

TL;DR

This work tackles realistic long-tailed semi-supervised learning where unlabeled data distributions are unknown. It introduces Controllable Pseudo-label Generation (CPG), a self-reinforcing cycle that expands the labeled set with reliably pseudo-labeled samples and trains on a distribution that is known, thereby decoupling from the unlabeled data's distribution. Key components include dynamic controllable filtering to select pseudo-labels, logit-adjusted Bayes-optimal classification, class-aware augmentation for minority classes, and an auxiliary branch for full data utilization. The authors provide theoretical guarantees on generalization error and demonstrate state-of-the-art performance across LTSSL benchmarks under diverse distribution scenarios, with substantial gains in challenging settings.

Abstract

Current long-tailed semi-supervised learning methods assume that labeled data exhibit a long-tailed distribution, and unlabeled data adhere to a typical predefined distribution (i.e., long-tailed, uniform, or inverse long-tailed). However, the distribution of the unlabeled data is generally unknown and may follow an arbitrary distribution. To tackle this challenge, we propose a Controllable Pseudo-label Generation (CPG) framework, expanding the labeled dataset with the progressively identified reliable pseudo-labels from the unlabeled dataset and training the model on the updated labeled dataset with a known distribution, making it unaffected by the unlabeled data distribution. Specifically, CPG operates through a controllable self-reinforcing optimization cycle: (i) at each training step, our dynamic controllable filtering mechanism selectively incorporates reliable pseudo-labels from the unlabeled dataset into the labeled dataset, ensuring that the updated labeled dataset follows a known distribution; (ii) we then construct a Bayes-optimal classifier using logit adjustment based on the updated labeled data distribution; (iii) this improved classifier subsequently helps identify more reliable pseudo-labels in the next training step. We further theoretically prove that this optimization cycle can significantly reduce the generalization error under some conditions. Additionally, we propose a class-aware adaptive augmentation module to further improve the representation of minority classes, and an auxiliary branch to maximize data utilization by leveraging all labeled and unlabeled samples. Comprehensive evaluations on various commonly used benchmark datasets show that CPG achieves consistent improvements, surpassing state-of-the-art methods by up to $\textbf{15.97%}$ in accuracy. The code is available at https://github.com/yaxinhou/CPG.

Keep It on a Leash: Controllable Pseudo-label Generation Towards Realistic Long-Tailed Semi-Supervised Learning

TL;DR

This work tackles realistic long-tailed semi-supervised learning where unlabeled data distributions are unknown. It introduces Controllable Pseudo-label Generation (CPG), a self-reinforcing cycle that expands the labeled set with reliably pseudo-labeled samples and trains on a distribution that is known, thereby decoupling from the unlabeled data's distribution. Key components include dynamic controllable filtering to select pseudo-labels, logit-adjusted Bayes-optimal classification, class-aware augmentation for minority classes, and an auxiliary branch for full data utilization. The authors provide theoretical guarantees on generalization error and demonstrate state-of-the-art performance across LTSSL benchmarks under diverse distribution scenarios, with substantial gains in challenging settings.

Abstract

Current long-tailed semi-supervised learning methods assume that labeled data exhibit a long-tailed distribution, and unlabeled data adhere to a typical predefined distribution (i.e., long-tailed, uniform, or inverse long-tailed). However, the distribution of the unlabeled data is generally unknown and may follow an arbitrary distribution. To tackle this challenge, we propose a Controllable Pseudo-label Generation (CPG) framework, expanding the labeled dataset with the progressively identified reliable pseudo-labels from the unlabeled dataset and training the model on the updated labeled dataset with a known distribution, making it unaffected by the unlabeled data distribution. Specifically, CPG operates through a controllable self-reinforcing optimization cycle: (i) at each training step, our dynamic controllable filtering mechanism selectively incorporates reliable pseudo-labels from the unlabeled dataset into the labeled dataset, ensuring that the updated labeled dataset follows a known distribution; (ii) we then construct a Bayes-optimal classifier using logit adjustment based on the updated labeled data distribution; (iii) this improved classifier subsequently helps identify more reliable pseudo-labels in the next training step. We further theoretically prove that this optimization cycle can significantly reduce the generalization error under some conditions. Additionally, we propose a class-aware adaptive augmentation module to further improve the representation of minority classes, and an auxiliary branch to maximize data utilization by leveraging all labeled and unlabeled samples. Comprehensive evaluations on various commonly used benchmark datasets show that CPG achieves consistent improvements, surpassing state-of-the-art methods by up to in accuracy. The code is available at https://github.com/yaxinhou/CPG.

Paper Structure

This paper contains 31 sections, 3 theorems, 24 equations, 7 figures, 16 tables, 1 algorithm.

Key Result

Theorem 1

Suppose that the loss function $\ell_{la}(f(x), y)$ is $\rho$-Lipschitz with respect to $f(x)$ for all $y \in \{1, \dots, C\}$ and upper-bounded by $U$. Given the pseudo-labeling error rate $0 < \epsilon < 1$, for any $\upsilon > 0$, with probability at least $1 - \upsilon$, we have: where $T$ denotes the total number of training steps, $R_T$ and $R_0$ represent the model empirical risk at the fi

Figures (7)

  • Figure 1: Comparison of pseudo-label predictions among FreeMatch FreeMatch, SimPro SimPro, and our CPG under arbitrary unlabeled data distribution. GT denotes the ground-truth unlabeled data distribution. TP (FP) denotes the predicted true (false) positive pseudo-labels. The dataset is CIFAR-10-LT with $(N_{max}, M_{max}, \gamma_l, \gamma_u) = (400, 4600, 50, 50)$, where $N_{max}$ ($M_{max}$) denotes the number of samples in the most frequent class of the labeled (unlabeled) dataset, while $\gamma_l$ ($\gamma_u$) denotes the imbalance ratio of the labeled (unlabeled) dataset. Our CPG can generate more reliable pseudo-labels than FreeMatch and SimPro in both minority classes like class 1, 2, and majority classes like class 8, 9.
  • Figure 2: Overview of the controllable self-reinforcing optimization cycle.
  • Figure 3: Comparison of pseudo-label error rate (a), pseudo-label utilization rate (b), and testing accuracy (c) among FreeMatch FreeMatch, SimPro SimPro, and our CPG under arbitrary unlabeled data distribution. The dataset is CIFAR-10-LT with $(N_{max}, M_{max}, \gamma_l, \gamma_u) = (400, 4600, 50, 50)$. The vertical gray dotted line indicates the initiation of pseudo-labeling in our method. Our CPG can generate pseudo-labels with a lower error rate and comparable utilization rate, achieving superior testing accuracy compared to both FreeMatch and SimPro.
  • Figure 4: Comparison of accuracy ($\%$) among FreeMatch FreeMatch, SimPro SimPro, and our CPG on CIFAR-10-LT and CIFAR-100-LT under arbitrary unlabeled data distribution with long-tailed labeled data distribution.
  • Figure 5: Evolution of pseudo-label predictions of our method under arbitrary unlabeled data distribution. GT denotes the ground-truth unlabeled data distribution. TP (FP) denotes the predicted true (false) positive pseudo-labels. KL denotes the Kullback-Leibler divergence between the predicted and ground-truth unlabeled data distributions. The dataset is CIFAR-10-LT with $(N_{max}, M_{max}, \gamma_l, \gamma_u) = (400, 4600, 50, 50)$. Our CPG can progressively increase the utilization rate of pseudo-labels while maintaining the high accuracy during training, as the pseudo-label distribution gradually approximates the ground-truth unlabeled data distribution.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Theorem 1: Generalization Error
  • proof
  • Lemma 1
  • proof
  • Lemma 2
  • proof