Table of Contents
Fetching ...

Vision-Language Models are Strong Noisy Label Detectors

Tong Wei, Hao-Tian Li, Chun-Shu Li, Jiang-Xin Shi, Yu-Feng Li, Min-Ling Zhang

TL;DR

The proposed framework establishes a noisy label detector by learning positive and negative textual prompts for each class, and employs parameter-efficient fine-tuning for the adaptation of a pre-trained visual encoder to promote its alignment with the learned textual prompts.

Abstract

Recent research on fine-tuning vision-language models has demonstrated impressive performance in various downstream tasks. However, the challenge of obtaining accurately labeled data in real-world applications poses a significant obstacle during the fine-tuning process. To address this challenge, this paper presents a Denoising Fine-Tuning framework, called DeFT, for adapting vision-language models. DeFT utilizes the robust alignment of textual and visual features pre-trained on millions of auxiliary image-text pairs to sieve out noisy labels. The proposed framework establishes a noisy label detector by learning positive and negative textual prompts for each class. The positive prompt seeks to reveal distinctive features of the class, while the negative prompt serves as a learnable threshold for separating clean and noisy samples. We employ parameter-efficient fine-tuning for the adaptation of a pre-trained visual encoder to promote its alignment with the learned textual prompts. As a general framework, DeFT can seamlessly fine-tune many pre-trained models to downstream tasks by utilizing carefully selected clean samples. Experimental results on seven synthetic and real-world noisy datasets validate the effectiveness of DeFT in both noisy label detection and image classification.

Vision-Language Models are Strong Noisy Label Detectors

TL;DR

The proposed framework establishes a noisy label detector by learning positive and negative textual prompts for each class, and employs parameter-efficient fine-tuning for the adaptation of a pre-trained visual encoder to promote its alignment with the learned textual prompts.

Abstract

Recent research on fine-tuning vision-language models has demonstrated impressive performance in various downstream tasks. However, the challenge of obtaining accurately labeled data in real-world applications poses a significant obstacle during the fine-tuning process. To address this challenge, this paper presents a Denoising Fine-Tuning framework, called DeFT, for adapting vision-language models. DeFT utilizes the robust alignment of textual and visual features pre-trained on millions of auxiliary image-text pairs to sieve out noisy labels. The proposed framework establishes a noisy label detector by learning positive and negative textual prompts for each class. The positive prompt seeks to reveal distinctive features of the class, while the negative prompt serves as a learnable threshold for separating clean and noisy samples. We employ parameter-efficient fine-tuning for the adaptation of a pre-trained visual encoder to promote its alignment with the learned textual prompts. As a general framework, DeFT can seamlessly fine-tune many pre-trained models to downstream tasks by utilizing carefully selected clean samples. Experimental results on seven synthetic and real-world noisy datasets validate the effectiveness of DeFT in both noisy label detection and image classification.
Paper Structure (44 sections, 8 equations, 4 figures, 11 tables, 1 algorithm)

This paper contains 44 sections, 8 equations, 4 figures, 11 tables, 1 algorithm.

Figures (4)

  • Figure 1: Comparison of different fine-tuning methods under (a) various ratios of noisy labels and (b) clean datasets.
  • Figure 2: Illustration of the proposed DeFT framework. Left: We identify noisy labels with learnable dual textual prompts and improve image-text alignment by optimizing PEFT modules. Right: Adapt pre-trained models using FFT on selected clean samples.
  • Figure 3: Ablation studies. We report the test accuracy across varying noise ratios for the following variants: 1) w/o adap.: DeFT without the model adaptation phase, 2) PEFT: use PEFT for model adaptation phase, and 3) FFT: use FFT for model adaptation phase.
  • Figure 4: Comparison of different parameter-efficient fine-tuning techniques on Tiny-ImageNet with various ratios of noisy labels