Table of Contents
Fetching ...

CLIPCleaner: Cleaning Noisy Labels with CLIP

Chen Feng, Georgios Tzimiropoulos, Ioannis Patras

TL;DR

This paper proposes a method that leverages CLIP, a powerful Vision-Language model for constructing a zero-shot classifier for efficient, offline, clean sample selection, and provides theoretical justifications and empirical evidence to demonstrate the advantages of CLIP for LNL compared to conventional pre-trained models.

Abstract

Learning with Noisy labels (LNL) poses a significant challenge for the Machine Learning community. Some of the most widely used approaches that select as clean samples for which the model itself (the in-training model) has high confidence, e.g., `small loss', can suffer from the so called `self-confirmation' bias. This bias arises because the in-training model, is at least partially trained on the noisy labels. Furthermore, in the classification case, an additional challenge arises because some of the label noise is between classes that are visually very similar (`hard noise'). This paper addresses these challenges by proposing a method (\textit{CLIPCleaner}) that leverages CLIP, a powerful Vision-Language (VL) model for constructing a zero-shot classifier for efficient, offline, clean sample selection. This has the advantage that the sample selection is decoupled from the in-training model and that the sample selection is aware of the semantic and visual similarities between the classes due to the way that CLIP is trained. We provide theoretical justifications and empirical evidence to demonstrate the advantages of CLIP for LNL compared to conventional pre-trained models. Compared to current methods that combine iterative sample selection with various techniques, \textit{CLIPCleaner} offers a simple, single-step approach that achieves competitive or superior performance on benchmark datasets. To the best of our knowledge, this is the first time a VL model has been used for sample selection to address the problem of Learning with Noisy Labels (LNL), highlighting their potential in the domain.

CLIPCleaner: Cleaning Noisy Labels with CLIP

TL;DR

This paper proposes a method that leverages CLIP, a powerful Vision-Language model for constructing a zero-shot classifier for efficient, offline, clean sample selection, and provides theoretical justifications and empirical evidence to demonstrate the advantages of CLIP for LNL compared to conventional pre-trained models.

Abstract

Learning with Noisy labels (LNL) poses a significant challenge for the Machine Learning community. Some of the most widely used approaches that select as clean samples for which the model itself (the in-training model) has high confidence, e.g., `small loss', can suffer from the so called `self-confirmation' bias. This bias arises because the in-training model, is at least partially trained on the noisy labels. Furthermore, in the classification case, an additional challenge arises because some of the label noise is between classes that are visually very similar (`hard noise'). This paper addresses these challenges by proposing a method (\textit{CLIPCleaner}) that leverages CLIP, a powerful Vision-Language (VL) model for constructing a zero-shot classifier for efficient, offline, clean sample selection. This has the advantage that the sample selection is decoupled from the in-training model and that the sample selection is aware of the semantic and visual similarities between the classes due to the way that CLIP is trained. We provide theoretical justifications and empirical evidence to demonstrate the advantages of CLIP for LNL compared to conventional pre-trained models. Compared to current methods that combine iterative sample selection with various techniques, \textit{CLIPCleaner} offers a simple, single-step approach that achieves competitive or superior performance on benchmark datasets. To the best of our knowledge, this is the first time a VL model has been used for sample selection to address the problem of Learning with Noisy Labels (LNL), highlighting their potential in the domain.
Paper Structure (39 sections, 5 theorems, 40 equations, 4 figures, 13 tables)

This paper contains 39 sections, 5 theorems, 40 equations, 4 figures, 13 tables.

Key Result

theorem 1

Let $\mathcal{G}, \mathcal{H}$ be the hypothesis space of vision encoder $g$ and language encoder $h$. Let us denote the rademacher complexity as $\mathfrak{R}(\mathcal{G}\circ\mathcal{H})$ of the combined CLIP model. Supposing the range of $L$ from eq:clip_loss as $[0, l^{clip}_{\infty}]$ for all ( with $\lambda_0, \lambda_1, \lambda_2, \lambda_3 > 0$. Here, $\varepsilon_{domain}$ denotes the bia

Figures (4)

  • Figure 1: Workflow of CLIPCleaner. We highlight the sections corresponding to the two main steps of CLIPCleaner, and particularly visualize the intuition of the probability estimation step based on the CLIP zero-shot classifier.
  • Figure 1: Ablations on MixFix with synthetic CIFAR100 noisy dataset. The top-3 results are bolded.
  • Figure 2: $N_{train}$ denotes number of training samples, $N_{clean}$ denotes number of clean training samples and $N_{all}$ denotes number of clean training samples.
  • Figure 3: Comparisons of various sample selection methods w.r.t different dataset/noise type/noise ratio. Here, we show the ROC AUC score of binary identification of clean samples.

Theorems & Definitions (5)

  • theorem 1: Estimation with zero-shot classifier
  • theorem 2: Estimation with induced classifier
  • lemma 1: Rademacher generalization error bound mohri2018foundationmodel
  • theorem 3: Estimation with zero-shot classifier
  • theorem 4: Estimation with induced classifier