Table of Contents
Fetching ...

NLPrompt: Noise-Label Prompt Learning for Vision-Language Models

Bikang Pan, Qun Li, Xiaoying Tang, Wei Huang, Zhen Fang, Feng Liu, Jingya Wang, Jingyi Yu, Ye Shi

TL;DR

This work tackles noisy labels in prompt learning for vision-language models by introducing PromptMAE, which applies mean absolute error to prompts, and PromptOT, a text-feature–based optimal transport data purifier. The combined NLPrompt framework partitions data into clean and noisy subsets and trains clean samples with cross-entropy while noisy ones use MAE, supported by a theoretical analysis showing MAE yields lower test loss under noise. The authors validate NLPrompt through extensive experiments across synthetic and real-world noisy datasets, demonstrating state-of-the-art improvements and strong generalization to other prompt-tuning methods. The approach leverages the alignment in vision-language models and efficient OT computation to deliver robust, scalable prompt learning with practical impact in real-world noisy-label settings.

Abstract

The emergence of vision-language foundation models, such as CLIP, has revolutionized image-text representation, enabling a broad range of applications via prompt learning. Despite its promise, real-world datasets often contain noisy labels that can degrade prompt learning performance. In this paper, we demonstrate that using mean absolute error (MAE) loss in prompt learning, named PromptMAE, significantly enhances robustness against noisy labels while maintaining high accuracy. Though MAE is straightforward and recognized for its robustness, it is rarely used in noisy-label learning due to its slow convergence and poor performance outside prompt learning scenarios. To elucidate the robustness of PromptMAE, we leverage feature learning theory to show that MAE can suppress the influence of noisy samples, thereby improving the signal-to-noise ratio and enhancing overall robustness. Additionally, we introduce PromptOT, a prompt-based optimal transport data purification method to enhance the robustness further. PromptOT employs text features in vision-language models as prototypes to construct an optimal transportation matrix. This matrix effectively partitions datasets into clean and noisy subsets, allowing for the application of cross-entropy loss to the clean subset and MAE loss to the noisy subset. Our Noise-Label Prompt Learning method, named NLPrompt, offers a simple and efficient approach that leverages the expressive representations and precise alignment capabilities of vision-language models for robust prompt learning. We validate NLPrompt through extensive experiments across various noise settings, demonstrating significant performance improvements.

NLPrompt: Noise-Label Prompt Learning for Vision-Language Models

TL;DR

This work tackles noisy labels in prompt learning for vision-language models by introducing PromptMAE, which applies mean absolute error to prompts, and PromptOT, a text-feature–based optimal transport data purifier. The combined NLPrompt framework partitions data into clean and noisy subsets and trains clean samples with cross-entropy while noisy ones use MAE, supported by a theoretical analysis showing MAE yields lower test loss under noise. The authors validate NLPrompt through extensive experiments across synthetic and real-world noisy datasets, demonstrating state-of-the-art improvements and strong generalization to other prompt-tuning methods. The approach leverages the alignment in vision-language models and efficient OT computation to deliver robust, scalable prompt learning with practical impact in real-world noisy-label settings.

Abstract

The emergence of vision-language foundation models, such as CLIP, has revolutionized image-text representation, enabling a broad range of applications via prompt learning. Despite its promise, real-world datasets often contain noisy labels that can degrade prompt learning performance. In this paper, we demonstrate that using mean absolute error (MAE) loss in prompt learning, named PromptMAE, significantly enhances robustness against noisy labels while maintaining high accuracy. Though MAE is straightforward and recognized for its robustness, it is rarely used in noisy-label learning due to its slow convergence and poor performance outside prompt learning scenarios. To elucidate the robustness of PromptMAE, we leverage feature learning theory to show that MAE can suppress the influence of noisy samples, thereby improving the signal-to-noise ratio and enhancing overall robustness. Additionally, we introduce PromptOT, a prompt-based optimal transport data purification method to enhance the robustness further. PromptOT employs text features in vision-language models as prototypes to construct an optimal transportation matrix. This matrix effectively partitions datasets into clean and noisy subsets, allowing for the application of cross-entropy loss to the clean subset and MAE loss to the noisy subset. Our Noise-Label Prompt Learning method, named NLPrompt, offers a simple and efficient approach that leverages the expressive representations and precise alignment capabilities of vision-language models for robust prompt learning. We validate NLPrompt through extensive experiments across various noise settings, demonstrating significant performance improvements.

Paper Structure

This paper contains 29 sections, 5 theorems, 41 equations, 4 figures, 10 tables, 1 algorithm.

Key Result

Lemma 4.1

At the $t$-th iteration, the learnable prompt $\mathbf{p}^{(t)}$ can be rewritten as a linear combination of the features and the prompt initialization: where $\alpha^{(t)}$ are the coefficients of the initialization, $\beta^{(t)}$ and $\phi_{l}^{(t)}$ are the coefficient of the task-relevant features and task-irrelevant features, respectively.

Figures (4)

  • Figure 1: The performance of training with MAE loss and CE loss in prompt learning on Caltech101 dataset.
  • Figure 2: The framework of our NLPrompt. We utilize the text representation to initialize prompt-based OT, which separates the dataset into clean and noisy subsets. NLPrompt harmonizes the advantage of MAE loss and CE loss. The former is more robust on the noisy dataset while the latter performs better on the clean dataset.
  • Figure 3: Performance with the different number of shots.
  • Figure A4: Test accuracy (%) under different entropy regularization coefficients.

Theorems & Definitions (7)

  • Lemma 4.1
  • Theorem 4.2
  • Lemma E.1: Restatement of Lemma \ref{['main_lemma']}: Feature Representation
  • Lemma E.3
  • proof
  • Theorem E.4: Restatement of Theorem \ref{['T4.2']}
  • proof