Table of Contents
Fetching ...

Improving realistic semi-supervised learning with doubly robust estimation

Khiem Pham, Charles Herrmann, Ramin Zabih

TL;DR

The paper tackles realistic long-tailed semi-supervised learning (RTSSL) where the unlabeled class distribution $P(Y|A=0)$ is unknown and differs from the labeled distribution. It introduces a two-stage approach: first estimate $P(Y|A=0)$ using a doubly robust estimator grounded in label-shift EM, then plug this estimate into existing pseudo-labeling methods to train the final classifier. The authors prove that the doubly robust estimator achieves asymptotic efficiency under mild rates and demonstrate substantial empirical gains across CIFAR-10/LT, CIFAR-100/LT, STL-10, and Imagenet-127 under varied unlabeled distributions. The method shows robustness to imperfect first-stage estimates and integrates smoothly with established SSL frameworks like SimPro and FixMatch, offering a practical path for realistic SSL in the presence of label shift.

Abstract

A major challenge in Semi-Supervised Learning (SSL) is the limited information available about the class distribution in the unlabeled data. In many real-world applications this arises from the prevalence of long-tailed distributions, where the standard pseudo-label approach to SSL is biased towards the labeled class distribution and thus performs poorly on unlabeled data. Existing methods typically assume that the unlabeled class distribution is either known a priori, which is unrealistic in most situations, or estimate it on-the-fly using the pseudo-labels themselves. We propose to explicitly estimate the unlabeled class distribution, which is a finite-dimensional parameter, \emph{as an initial step}, using a doubly robust estimator with a strong theoretical guarantee; this estimate can then be integrated into existing methods to pseudo-label the unlabeled data during training more accurately. Experimental results demonstrate that incorporating our techniques into common pseudo-labeling approaches improves their performance.

Improving realistic semi-supervised learning with doubly robust estimation

TL;DR

The paper tackles realistic long-tailed semi-supervised learning (RTSSL) where the unlabeled class distribution is unknown and differs from the labeled distribution. It introduces a two-stage approach: first estimate using a doubly robust estimator grounded in label-shift EM, then plug this estimate into existing pseudo-labeling methods to train the final classifier. The authors prove that the doubly robust estimator achieves asymptotic efficiency under mild rates and demonstrate substantial empirical gains across CIFAR-10/LT, CIFAR-100/LT, STL-10, and Imagenet-127 under varied unlabeled distributions. The method shows robustness to imperfect first-stage estimates and integrates smoothly with established SSL frameworks like SimPro and FixMatch, offering a practical path for realistic SSL in the presence of label shift.

Abstract

A major challenge in Semi-Supervised Learning (SSL) is the limited information available about the class distribution in the unlabeled data. In many real-world applications this arises from the prevalence of long-tailed distributions, where the standard pseudo-label approach to SSL is biased towards the labeled class distribution and thus performs poorly on unlabeled data. Existing methods typically assume that the unlabeled class distribution is either known a priori, which is unrealistic in most situations, or estimate it on-the-fly using the pseudo-labels themselves. We propose to explicitly estimate the unlabeled class distribution, which is a finite-dimensional parameter, \emph{as an initial step}, using a doubly robust estimator with a strong theoretical guarantee; this estimate can then be integrated into existing methods to pseudo-label the unlabeled data during training more accurately. Experimental results demonstrate that incorporating our techniques into common pseudo-labeling approaches improves their performance.

Paper Structure

This paper contains 26 sections, 1 theorem, 29 equations, 2 figures, 8 tables.

Key Result

Theorem 3.2

Under the assumption assumption:4th-root-n the DR estimator $\Psi_{dr}$ is asymptotically normal with 0-mean and the efficient influence function's variance:

Figures (2)

  • Figure 1: The labeled class distribution and 5 possible unlabeled class distributions studied in simpro. SimPro significantly overestimates the head classes in consistent, reverse and head-tail settings. Our doubly-robust estimate is more accurate at the head classes as well as the overall distribution in all but the middle setting, as measured in total variation distance in \ref{['tab:cifar10-tv']}. Our proposed 2-stage SimPro+ outperforms SimPro in classification accuracy in the middle setting as well.
  • Figure 2: Overview of our 2-stage method (\ref{['subsec:2-stage']}). In stage 1, we use Expectation-Maximization (EM, \ref{['subsec:em']}) to estimate the missingness mechanism and classifier from observable data. These quantities are used as nuisance components for the doubly-robust estimator of the class distribution \ref{['eq:dr']}. In stage 2, we can use EM or other existing methods that also use logit-adjustment with the (unlabeled) class distribution to estimate the final classifier. We use SimPro as our implementation of EM (\ref{['subsec:simpro']}). The network in stage 1 can be of equal or smaller size than the network in stage 2 (\ref{['subsec:label']}).

Theorems & Definitions (2)

  • Theorem 3.2
  • proof : Proof of \ref{['theorem:dr']}