Table of Contents
Fetching ...

Learning from Noisy Labels with Contrastive Co-Transformer

Yan Han, Soumava Kumar Roy, Mehrtash Harandi, Lars Petersson

TL;DR

This work addresses learning from noisy image-labels by integrating a contrastive loss into a Co-Training framework with two transformer encoders (CCT). It leverages all mini-batch samples through an unsupervised contrastive objective $L_{con}$ alongside supervised $L_{ce}$, forming total losses $L_1 = L_{ce}^1 + \lambda L_{con}$ and $L_2 = L_{ce}^2 + \lambda L_{con}$ with $\lambda = 0.0001$. The approach demonstrates strong empirical performance across six datasets, including Clothing1M, often surpassing state-of-the-art noisy-label methods while using fewer parameters and less computation. This indicates transformers can provide improved robustness to label noise when combined with contrastive learning in a Co-Training setup, with practical impact for real-world noisy data scenarios.

Abstract

Deep learning with noisy labels is an interesting challenge in weakly supervised learning. Despite their significant learning capacity, CNNs have a tendency to overfit in the presence of samples with noisy labels. Alleviating this issue, the well known Co-Training framework is used as a fundamental basis for our work. In this paper, we introduce a Contrastive Co-Transformer framework, which is simple and fast, yet able to improve the performance by a large margin compared to the state-of-the-art approaches. We argue the robustness of transformers when dealing with label noise. Our Contrastive Co-Transformer approach is able to utilize all samples in the dataset, irrespective of whether they are clean or noisy. Transformers are trained by a combination of contrastive loss and classification loss. Extensive experimental results on corrupted data from six standard benchmark datasets including Clothing1M, demonstrate that our Contrastive Co-Transformer is superior to existing state-of-the-art methods.

Learning from Noisy Labels with Contrastive Co-Transformer

TL;DR

This work addresses learning from noisy image-labels by integrating a contrastive loss into a Co-Training framework with two transformer encoders (CCT). It leverages all mini-batch samples through an unsupervised contrastive objective alongside supervised , forming total losses and with . The approach demonstrates strong empirical performance across six datasets, including Clothing1M, often surpassing state-of-the-art noisy-label methods while using fewer parameters and less computation. This indicates transformers can provide improved robustness to label noise when combined with contrastive learning in a Co-Training setup, with practical impact for real-world noisy data scenarios.

Abstract

Deep learning with noisy labels is an interesting challenge in weakly supervised learning. Despite their significant learning capacity, CNNs have a tendency to overfit in the presence of samples with noisy labels. Alleviating this issue, the well known Co-Training framework is used as a fundamental basis for our work. In this paper, we introduce a Contrastive Co-Transformer framework, which is simple and fast, yet able to improve the performance by a large margin compared to the state-of-the-art approaches. We argue the robustness of transformers when dealing with label noise. Our Contrastive Co-Transformer approach is able to utilize all samples in the dataset, irrespective of whether they are clean or noisy. Transformers are trained by a combination of contrastive loss and classification loss. Extensive experimental results on corrupted data from six standard benchmark datasets including Clothing1M, demonstrate that our Contrastive Co-Transformer is superior to existing state-of-the-art methods.

Paper Structure

This paper contains 11 sections, 6 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Test accuracy of plain CNN and transformer based architectures with $45\%$ of pairflip noise on CIFAR10.
  • Figure 2: Schematic diagram of CCT. The noisy dataset is fed into two transformers in parallel. Features (${\boldsymbol{h}}_1$ and ${\boldsymbol{h}}_2$) are extracted before the linear layers. Simultaneously, both ${\boldsymbol{h}}_1$ and ${\boldsymbol{h}}_2$ are passed through classifiers for calculating the classification loss (${\boldsymbol{p}}_1$ and ${\boldsymbol{p}}_2$).