Table of Contents
Fetching ...

Towards Efficient and General-Purpose Few-Shot Misclassification Detection for Vision-Language Models

Fanhu Zeng, Zhen Cheng, Fei Zhu, Xu-Yao Zhang

TL;DR

This work tackles the problem of misclassification detection under safety-critical conditions by addressing overconfident errors. It introduces FSMisD, a few-shot prompt-learning framework that leverages vision-language models to avoid training from scratch, using per-class category prompts and adaptive pseudo samples generated via textual guided augmentation, complemented by a negative loss and orthogonal negative prompts to calibrate confidence. The method demonstrates consistent improvements across ImageNet and domain-shift datasets, including natural adversaries and out-of-distribution scenarios, while offering substantial gains in efficiency over traditional full-training MisD approaches. The practical impact lies in enabling scalable, efficient, and robust misclassification detection for large-scale and dynamically changing datasets, with strong generalization and competitive performance on smaller benchmarks as well.

Abstract

Reliable prediction by classifiers is crucial for their deployment in high security and dynamically changing situations. However, modern neural networks often exhibit overconfidence for misclassified predictions, highlighting the need for confidence estimation to detect errors. Despite the achievements obtained by existing methods on small-scale datasets, they all require training from scratch and there are no efficient and effective misclassification detection (MisD) methods, hindering practical application towards large-scale and ever-changing datasets. In this paper, we pave the way to exploit vision language model (VLM) leveraging text information to establish an efficient and general-purpose misclassification detection framework. By harnessing the power of VLM, we construct FSMisD, a Few-Shot prompt learning framework for MisD to refrain from training from scratch and therefore improve tuning efficiency. To enhance misclassification detection ability, we use adaptive pseudo sample generation and a novel negative loss to mitigate the issue of overconfidence by pushing category prompts away from pseudo features. We conduct comprehensive experiments with prompt learning methods and validate the generalization ability across various datasets with domain shift. Significant and consistent improvement demonstrates the effectiveness, efficiency and generalizability of our approach.

Towards Efficient and General-Purpose Few-Shot Misclassification Detection for Vision-Language Models

TL;DR

This work tackles the problem of misclassification detection under safety-critical conditions by addressing overconfident errors. It introduces FSMisD, a few-shot prompt-learning framework that leverages vision-language models to avoid training from scratch, using per-class category prompts and adaptive pseudo samples generated via textual guided augmentation, complemented by a negative loss and orthogonal negative prompts to calibrate confidence. The method demonstrates consistent improvements across ImageNet and domain-shift datasets, including natural adversaries and out-of-distribution scenarios, while offering substantial gains in efficiency over traditional full-training MisD approaches. The practical impact lies in enabling scalable, efficient, and robust misclassification detection for large-scale and dynamically changing datasets, with strong generalization and competitive performance on smaller benchmarks as well.

Abstract

Reliable prediction by classifiers is crucial for their deployment in high security and dynamically changing situations. However, modern neural networks often exhibit overconfidence for misclassified predictions, highlighting the need for confidence estimation to detect errors. Despite the achievements obtained by existing methods on small-scale datasets, they all require training from scratch and there are no efficient and effective misclassification detection (MisD) methods, hindering practical application towards large-scale and ever-changing datasets. In this paper, we pave the way to exploit vision language model (VLM) leveraging text information to establish an efficient and general-purpose misclassification detection framework. By harnessing the power of VLM, we construct FSMisD, a Few-Shot prompt learning framework for MisD to refrain from training from scratch and therefore improve tuning efficiency. To enhance misclassification detection ability, we use adaptive pseudo sample generation and a novel negative loss to mitigate the issue of overconfidence by pushing category prompts away from pseudo features. We conduct comprehensive experiments with prompt learning methods and validate the generalization ability across various datasets with domain shift. Significant and consistent improvement demonstrates the effectiveness, efficiency and generalizability of our approach.

Paper Structure

This paper contains 23 sections, 8 equations, 7 figures, 10 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overall definition of misclassification detection and comparison between traditional methods and our framework.
  • Figure 2: Efficiency comparison between traditional method and the proposed method.
  • Figure 3: Overall structure of the proposed prompt learning framework for misclassification detection. Adaptive pseudo sample generation is employed to generate pseudo samples and enhance misclassification detection by pushing category prompts away from pseudo labels. Two components are trained together in an end-to-end manner to construct a clear margin for misclassification and enhance confidence estimation.
  • Figure 4: Variation of few-shot misclassification detection performance as samples of each class increase. Outcomes of AURC, AUROC, and FPR95 are reported, respectively.
  • Figure 5: Visualization of different cases in MisD. Compared with previous method, our method successfully mitigates overconfidence and improves MisD performance.
  • ...and 2 more figures