Vision-Language Models are Strong Noisy Label Detectors

Tong Wei; Hao-Tian Li; Chun-Shu Li; Jiang-Xin Shi; Yu-Feng Li; Min-Ling Zhang

Vision-Language Models are Strong Noisy Label Detectors

Tong Wei, Hao-Tian Li, Chun-Shu Li, Jiang-Xin Shi, Yu-Feng Li, Min-Ling Zhang

TL;DR

The proposed framework establishes a noisy label detector by learning positive and negative textual prompts for each class, and employs parameter-efficient fine-tuning for the adaptation of a pre-trained visual encoder to promote its alignment with the learned textual prompts.

Abstract

Recent research on fine-tuning vision-language models has demonstrated impressive performance in various downstream tasks. However, the challenge of obtaining accurately labeled data in real-world applications poses a significant obstacle during the fine-tuning process. To address this challenge, this paper presents a Denoising Fine-Tuning framework, called DeFT, for adapting vision-language models. DeFT utilizes the robust alignment of textual and visual features pre-trained on millions of auxiliary image-text pairs to sieve out noisy labels. The proposed framework establishes a noisy label detector by learning positive and negative textual prompts for each class. The positive prompt seeks to reveal distinctive features of the class, while the negative prompt serves as a learnable threshold for separating clean and noisy samples. We employ parameter-efficient fine-tuning for the adaptation of a pre-trained visual encoder to promote its alignment with the learned textual prompts. As a general framework, DeFT can seamlessly fine-tune many pre-trained models to downstream tasks by utilizing carefully selected clean samples. Experimental results on seven synthetic and real-world noisy datasets validate the effectiveness of DeFT in both noisy label detection and image classification.

Vision-Language Models are Strong Noisy Label Detectors

TL;DR

Abstract

Paper Structure (44 sections, 8 equations, 4 figures, 11 tables, 1 algorithm)

This paper contains 44 sections, 8 equations, 4 figures, 11 tables, 1 algorithm.

Introduction
Related Work
Preliminary and Initial Findings
Zero-Shot CLIP
Fine-tuning CLIP on Downstream Datasets
VPT benefits representation learning in the presence of massive noisy labels.
Textual classifier is robust to noisy labels.
FFT enhances visual recognition on clean datasets.
The Denoising Fine-tuning Framework
Identifying Noisy Labels with Dual Prompts
Optimization for Noisy Label Detector
Model Adaptation using Clean Data
Experiment
Experimental Settings
Synthetic Datasets
...and 29 more sections

Figures (4)

Figure 1: Comparison of different fine-tuning methods under (a) various ratios of noisy labels and (b) clean datasets.
Figure 2: Illustration of the proposed DeFT framework. Left: We identify noisy labels with learnable dual textual prompts and improve image-text alignment by optimizing PEFT modules. Right: Adapt pre-trained models using FFT on selected clean samples.
Figure 3: Ablation studies. We report the test accuracy across varying noise ratios for the following variants: 1) w/o adap.: DeFT without the model adaptation phase, 2) PEFT: use PEFT for model adaptation phase, and 3) FFT: use FFT for model adaptation phase.
Figure 4: Comparison of different parameter-efficient fine-tuning techniques on Tiny-ImageNet with various ratios of noisy labels

Vision-Language Models are Strong Noisy Label Detectors

TL;DR

Abstract

Vision-Language Models are Strong Noisy Label Detectors

Authors

TL;DR

Abstract

Table of Contents

Figures (4)