Table of Contents
Fetching ...

SimPro: A Simple Probabilistic Framework Towards Realistic Long-Tailed Semi-Supervised Learning

Chaoqun Du, Yizeng Han, Gao Huang

TL;DR

SimPro tackles realistic long-tailed semi-supervised learning where unlabeled data may follow an unknown, mismatched class distribution. It reframes SSL as an EM-style framework that explicitly decouples the modeling of conditional P(x|y) and marginal P(y), enabling a closed-form update for the class priors π while learning θ via standard optimization and building a Bayes classifier for pseudo-labels. The method comes with theoretical backing, a simple implementation, and extends evaluation to two novel unlabeled distributions, showing consistent state-of-the-art results across CIFAR-10/100-LT, STL10-LT, and ImageNet variants. This work offers a robust, distribution-agnostic approach with practical gains and accessible code for real-world LTSSL deployment.

Abstract

Recent advancements in semi-supervised learning have focused on a more realistic yet challenging task: addressing imbalances in labeled data while the class distribution of unlabeled data remains both unknown and potentially mismatched. Current approaches in this sphere often presuppose rigid assumptions regarding the class distribution of unlabeled data, thereby limiting the adaptability of models to only certain distribution ranges. In this study, we propose a novel approach, introducing a highly adaptable framework, designated as SimPro, which does not rely on any predefined assumptions about the distribution of unlabeled data. Our framework, grounded in a probabilistic model, innovatively refines the expectation-maximization (EM) algorithm by explicitly decoupling the modeling of conditional and marginal class distributions. This separation facilitates a closed-form solution for class distribution estimation during the maximization phase, leading to the formulation of a Bayes classifier. The Bayes classifier, in turn, enhances the quality of pseudo-labels in the expectation phase. Remarkably, the SimPro framework not only comes with theoretical guarantees but also is straightforward to implement. Moreover, we introduce two novel class distributions broadening the scope of the evaluation. Our method showcases consistent state-of-the-art performance across diverse benchmarks and data distribution scenarios. Our code is available at https://github.com/LeapLabTHU/SimPro.

SimPro: A Simple Probabilistic Framework Towards Realistic Long-Tailed Semi-Supervised Learning

TL;DR

SimPro tackles realistic long-tailed semi-supervised learning where unlabeled data may follow an unknown, mismatched class distribution. It reframes SSL as an EM-style framework that explicitly decouples the modeling of conditional P(x|y) and marginal P(y), enabling a closed-form update for the class priors π while learning θ via standard optimization and building a Bayes classifier for pseudo-labels. The method comes with theoretical backing, a simple implementation, and extends evaluation to two novel unlabeled distributions, showing consistent state-of-the-art results across CIFAR-10/100-LT, STL10-LT, and ImageNet variants. This work offers a robust, distribution-agnostic approach with practical gains and accessible code for real-world LTSSL deployment.

Abstract

Recent advancements in semi-supervised learning have focused on a more realistic yet challenging task: addressing imbalances in labeled data while the class distribution of unlabeled data remains both unknown and potentially mismatched. Current approaches in this sphere often presuppose rigid assumptions regarding the class distribution of unlabeled data, thereby limiting the adaptability of models to only certain distribution ranges. In this study, we propose a novel approach, introducing a highly adaptable framework, designated as SimPro, which does not rely on any predefined assumptions about the distribution of unlabeled data. Our framework, grounded in a probabilistic model, innovatively refines the expectation-maximization (EM) algorithm by explicitly decoupling the modeling of conditional and marginal class distributions. This separation facilitates a closed-form solution for class distribution estimation during the maximization phase, leading to the formulation of a Bayes classifier. The Bayes classifier, in turn, enhances the quality of pseudo-labels in the expectation phase. Remarkably, the SimPro framework not only comes with theoretical guarantees but also is straightforward to implement. Moreover, we introduce two novel class distributions broadening the scope of the evaluation. Our method showcases consistent state-of-the-art performance across diverse benchmarks and data distribution scenarios. Our code is available at https://github.com/LeapLabTHU/SimPro.
Paper Structure (34 sections, 3 theorems, 56 equations, 6 figures, 9 tables, 1 algorithm)

This paper contains 34 sections, 3 theorems, 56 equations, 6 figures, 9 tables, 1 algorithm.

Key Result

Proposition 1

The optimal $\bm{\hat{\pi}}$ that maximizes $\mathcal{Q}(\bm{\theta},\bm{\pi};\bm{\theta}',\bm{\pi}')$ is

Figures (6)

  • Figure 1: The general idea of SimPro addressing the ReaLTSSL problem. (a) Current methods typically rely on predefined or assumed class distribution patterns for unlabeled data, limiting their applicability. (b) In contrast, our SimPro embraces a more realistic scenario by introducing a simple and elegant framework that operates effectively without making any assumptions about the distribution of unlabeled data. This paradigm shift allows for greater flexibility and applicability in diverse ReaLTSSL scenarios.
  • Figure 2: The SimPro Framework Overview. This framework distinctively separates the conditional and marginal (class) distributions. In the E-step (top), pseudo-labels are generated using the current parameters $\bm{\theta}$ and $\bm{\pi}$. In the subsequent M-step (bottom), these pseudo-labels, along with the ground-truth labels, are utilized to compute the Cross-Entropy loss (refer to \ref{['eq:overall_loss']}), facilitating the optimization of network parameters $\bm{\theta}$ via gradient descent. Concurrently, the marginal distribution parameter $\bm{\pi}$ is recalculated using a closed-form solution based on the generated pseudo-labels (as detailed in \ref{['eq:pi']}).
  • Figure 3: Test the performance under more imbalance ratios on CIFAR10-LT with $\gamma_l=150$, $N_1 = 500$, and $M_1 = 4000$.
  • Figure 4: Sensitive analysis of the confidence threshold $t$ on CIFAR100-LT with $\gamma_l=20$, $N_1\!=\!50$, and $M_1\!=\!400$ and CIFAR10-LT with $\gamma_l\!=\!150$, $N_1\!=\!500$, and $M_1\!=\!4000$. The optimal performance is consistently achieved across different settings when the threshold is set at $t=0.2$ and $0.95$, respectively.
  • Figure 5: Visualization of the quality of the estimated distribution on CIFAR10-LT with $\gamma_l\!=\!150$, $N_1\!=\!500$, and $M_1\!=\!4000$. The KL distances reduce to near-zero values after very few epochs.
  • ...and 1 more figures

Theorems & Definitions (6)

  • Proposition 1: Closed-form Solution for $\bm{\pi}$
  • Proposition 2: Bayes Classifier
  • Proposition 3: Regret Bound
  • proof
  • proof
  • proof