Scaling Up Semi-supervised Learning with Unconstrained Unlabelled Data

Shuvendu Roy; Ali Etemad

Scaling Up Semi-supervised Learning with Unconstrained Unlabelled Data

Shuvendu Roy, Ali Etemad

TL;DR

UnMixMatch tackles semi-supervised learning with unconstrained unlabeled data by decoupling representation learning from class supervision. It fuses a hard-augmented supervised module (RandMixUp), a contrastive embedding regularizer using InfoNCE, and a rotation-based self-supervised task, unified under the total loss $\mathcal{L}_{UnMixMatch} = \mathcal{L}_{sup} + \beta \mathcal{L}_{con} + \gamma\mathcal{L}_{rot}$. Empirically, it achieves a 4.79% average improvement over prior SSL methods on CIFAR-10/100, SVHN, and STL-10 under unconstrained unlabeled data, and demonstrates strong scaling with larger unlabeled sets, open-set SSL, and barely supervised scenarios. The approach sets a new state of the art for open-set SSL and demonstrates practical impact by enabling SSL to leverage web-scale unlabeled data, with code released for reproducibility.

Abstract

We propose UnMixMatch, a semi-supervised learning framework which can learn effective representations from unconstrained unlabelled data in order to scale up performance. Most existing semi-supervised methods rely on the assumption that labelled and unlabelled samples are drawn from the same distribution, which limits the potential for improvement through the use of free-living unlabeled data. Consequently, the generalizability and scalability of semi-supervised learning are often hindered by this assumption. Our method aims to overcome these constraints and effectively utilize unconstrained unlabelled data in semi-supervised learning. UnMixMatch consists of three main components: a supervised learner with hard augmentations that provides strong regularization, a contrastive consistency regularizer to learn underlying representations from the unlabelled data, and a self-supervised loss to enhance the representations that are learnt from the unlabelled data. We perform extensive experiments on 4 commonly used datasets and demonstrate superior performance over existing semi-supervised methods with a performance boost of 4.79%. Extensive ablation and sensitivity studies show the effectiveness and impact of each of the proposed components of our method.

Scaling Up Semi-supervised Learning with Unconstrained Unlabelled Data

TL;DR

. Empirically, it achieves a 4.79% average improvement over prior SSL methods on CIFAR-10/100, SVHN, and STL-10 under unconstrained unlabeled data, and demonstrates strong scaling with larger unlabeled sets, open-set SSL, and barely supervised scenarios. The approach sets a new state of the art for open-set SSL and demonstrates practical impact by enabling SSL to leverage web-scale unlabeled data, with code released for reproducibility.

Abstract

Paper Structure (25 sections, 7 equations, 3 figures, 8 tables)

This paper contains 25 sections, 7 equations, 3 figures, 8 tables.

Introduction
Related Work
Constrained Semi-supervised Learning
Open Set Semi-supervised Learning
Method
Preliminaries and Overview
Supervised Module
Consistency Regularization Module
Self-supervised Module
Total Loss
Experiments and Results
Datasets and Implementation Details
Results
Unconstrained Settings.
Scaling Up the Unlabelled Set.
...and 10 more sections

Figures (3)

Figure 1: Learning from unconstrained unlabelled data (left). Scaling up unlabelled data provides a large improvement for UnMixMatch (right).
Figure 2: Overview of our proposed method.
Figure 3: Sensitivity study on important hyper-parameters.

Scaling Up Semi-supervised Learning with Unconstrained Unlabelled Data

TL;DR

Abstract

Scaling Up Semi-supervised Learning with Unconstrained Unlabelled Data

Authors

TL;DR

Abstract

Table of Contents

Figures (3)