Table of Contents
Fetching ...

DivideMix: Learning with Noisy Labels as Semi-supervised Learning

Junnan Li, Richard Socher, Steven C. H. Hoi

TL;DR

DivideMix tackles noisy labels by reframing learning with noisy data as a semi-supervised problem. It uses a two-network co-divide mechanism where per-sample losses are modeled with a Gaussian Mixture Model to separate clean and noisy samples, and two networks teach each other to avoid confirmation bias. In the SSL phase, label refinement and co-guessing extend MixMatch to noisy settings, yielding strong empirical gains on CIFAR-10/100, Clothing1M, and WebVision. The approach advances robust learning with noisy labels and demonstrates practical impact by reducing annotation costs while preserving accuracy.

Abstract

Deep neural networks are known to be annotation-hungry. Numerous efforts have been devoted to reducing the annotation cost when learning with deep networks. Two prominent directions include learning with noisy labels and semi-supervised learning by exploiting unlabeled data. In this work, we propose DivideMix, a novel framework for learning with noisy labels by leveraging semi-supervised learning techniques. In particular, DivideMix models the per-sample loss distribution with a mixture model to dynamically divide the training data into a labeled set with clean samples and an unlabeled set with noisy samples, and trains the model on both the labeled and unlabeled data in a semi-supervised manner. To avoid confirmation bias, we simultaneously train two diverged networks where each network uses the dataset division from the other network. During the semi-supervised training phase, we improve the MixMatch strategy by performing label co-refinement and label co-guessing on labeled and unlabeled samples, respectively. Experiments on multiple benchmark datasets demonstrate substantial improvements over state-of-the-art methods. Code is available at https://github.com/LiJunnan1992/DivideMix .

DivideMix: Learning with Noisy Labels as Semi-supervised Learning

TL;DR

DivideMix tackles noisy labels by reframing learning with noisy data as a semi-supervised problem. It uses a two-network co-divide mechanism where per-sample losses are modeled with a Gaussian Mixture Model to separate clean and noisy samples, and two networks teach each other to avoid confirmation bias. In the SSL phase, label refinement and co-guessing extend MixMatch to noisy settings, yielding strong empirical gains on CIFAR-10/100, Clothing1M, and WebVision. The approach advances robust learning with noisy labels and demonstrates practical impact by reducing annotation costs while preserving accuracy.

Abstract

Deep neural networks are known to be annotation-hungry. Numerous efforts have been devoted to reducing the annotation cost when learning with deep networks. Two prominent directions include learning with noisy labels and semi-supervised learning by exploiting unlabeled data. In this work, we propose DivideMix, a novel framework for learning with noisy labels by leveraging semi-supervised learning techniques. In particular, DivideMix models the per-sample loss distribution with a mixture model to dynamically divide the training data into a labeled set with clean samples and an unlabeled set with noisy samples, and trains the model on both the labeled and unlabeled data in a semi-supervised manner. To avoid confirmation bias, we simultaneously train two diverged networks where each network uses the dataset division from the other network. During the semi-supervised training phase, we improve the MixMatch strategy by performing label co-refinement and label co-guessing on labeled and unlabeled samples, respectively. Experiments on multiple benchmark datasets demonstrate substantial improvements over state-of-the-art methods. Code is available at https://github.com/LiJunnan1992/DivideMix .

Paper Structure

This paper contains 16 sections, 8 equations, 5 figures, 9 tables, 1 algorithm.

Figures (5)

  • Figure 1: $\mathrm{DivideMix}$ trains two networks (A and B) simultaneously. At each epoch, a network models its per-sample loss distribution with a GMM to divide the dataset into a labeled set (mostly clean) and an unlabeled set (mostly noisy), which is then used as training data for the other network (i.e. co-divide). At each mini-batch, a network performs semi-supervised training using an improved $\mathrm{MixMatch}$ method. We perform label co-refinement on the labeled samples and label co-guessing on the unlabeled samples.
  • Figure 2: Training on CIFAR-10 with 40% asymmetric noise, warm up for 10 epochs. (a) Standard training with cross-entropy loss causes the model to overfit and produce over-confident predictions, making $\ell$ difficult to be modeled by the GMM. (b) Adding a confidence penalty (negative entropy) during warm up leads to more evenly-distributed $\ell$. (c) Training with $\mathrm{DivideMix}$ can effectively reduce the loss for clean samples while keeping the loss larger for most noisy samples.
  • Figure 3: Area Under a Curve for clean/noisy image classification on CIFAR-10 training samples. Our method can effectively filter out the noisy samples and leverage them as unlabeled data.
  • Figure 4: Clothing1M images identified as noisy samples by our method. Ground-truth labels are shown above in red and the co-guessed labels are shown below in blue.
  • Figure 5: T-SNE of training images after training the model using $\mathrm{DivideMix}$ for 200 epochs on CIFAR-10 with 80% label noise. Different colors indicate (a) noisy training labels or (b) true labels. $\mathrm{DivideMix}$ is able to learn the true class distribution of the training data despite the label noise.