Table of Contents
Fetching ...

Self-Calibrated Tuning of Vision-Language Models for Out-of-Distribution Detection

Geng Yu, Jianing Zhu, Jiangchao Yao, Bo Han

TL;DR

A novel framework, namely, Self-Calibrated Tuning (SCT), is proposed to mitigate this problem for effective OOD detection with only the given few-shot ID data and adaptively directs the optimization process between the two tasks during training on data with different prediction uncertainty to calibrate the influence of OOD regularization.

Abstract

Out-of-distribution (OOD) detection is crucial for deploying reliable machine learning models in open-world applications. Recent advances in CLIP-based OOD detection have shown promising results via regularizing prompt tuning with OOD features extracted from ID data. However, the irrelevant context mined from ID data can be spurious due to the inaccurate foreground-background decomposition, thus limiting the OOD detection performance. In this work, we propose a novel framework, namely, Self-Calibrated Tuning (SCT), to mitigate this problem for effective OOD detection with only the given few-shot ID data. Specifically, SCT introduces modulating factors respectively on the two components of the original learning objective. It adaptively directs the optimization process between the two tasks during training on data with different prediction uncertainty to calibrate the influence of OOD regularization, which is compatible with many prompt tuning based OOD detection methods. Extensive experiments and analyses have been conducted to characterize and demonstrate the effectiveness of the proposed SCT. The code is publicly available.

Self-Calibrated Tuning of Vision-Language Models for Out-of-Distribution Detection

TL;DR

A novel framework, namely, Self-Calibrated Tuning (SCT), is proposed to mitigate this problem for effective OOD detection with only the given few-shot ID data and adaptively directs the optimization process between the two tasks during training on data with different prediction uncertainty to calibrate the influence of OOD regularization.

Abstract

Out-of-distribution (OOD) detection is crucial for deploying reliable machine learning models in open-world applications. Recent advances in CLIP-based OOD detection have shown promising results via regularizing prompt tuning with OOD features extracted from ID data. However, the irrelevant context mined from ID data can be spurious due to the inaccurate foreground-background decomposition, thus limiting the OOD detection performance. In this work, we propose a novel framework, namely, Self-Calibrated Tuning (SCT), to mitigate this problem for effective OOD detection with only the given few-shot ID data. Specifically, SCT introduces modulating factors respectively on the two components of the original learning objective. It adaptively directs the optimization process between the two tasks during training on data with different prediction uncertainty to calibrate the influence of OOD regularization, which is compatible with many prompt tuning based OOD detection methods. Extensive experiments and analyses have been conducted to characterize and demonstrate the effectiveness of the proposed SCT. The code is publicly available.

Paper Structure

This paper contains 54 sections, 12 equations, 5 figures, 12 tables, 1 algorithm.

Figures (5)

  • Figure 1: Imperfect foreground background decomposition. The top row shows the original images from ImageNet-1k and the bottom row shows the ID-irrelevant context extracted from the original images (shown as the colored patches of images on the second row), using CLIP fine-tuned with CoOp on 16-shot data. Due to the imperfect decomposition of fine-tuned vision-language models, large portions of the extracted local features from ID data belong to ID-related regions, thus harming the performance of OOD detection. More illustrations are presented in the Appendix \ref{['app:more_demo']}.
  • Figure 2: Empirical demonstration about invalid OOD features extracted from ID data in LoCoOp and the influence of sample uncertainty on OOD detection performance. In the left and middle panels, we illustrate the extracted OOD features at different levels of uncertainty and find that they become unreliable as the uncertainty increases. The numbers at the bottom denote the prediction probability for the ground-truth labels from fine-tuned CLIP. In the right panel, we collect ID samples of different uncertainty levels based on prompt-tuned CLIP and divide them into 2 groups. The result demonstrates that the OOD detection performance of LoCoOp significantly degrades as the uncertainty level of ID data rises. We leave the experimental details in Appendix \ref{['app:add_exp_setup']} for reference.
  • Figure 3: Ablation study. (a) performance of using different regularization weights $\lambda$; (b) exploration of different regularization functions for OOD regularization; (c) using different CLIP architectures; (d) comparison of different methods for extracting OOD features.
  • Figure 4: The comparison of calibration measured by ECE between SCT and LoCoOp trained with 1, 2, 4, 16 shots data. The evaluation is performed on the original validation set of ImageNet-1k.
  • Figure 5: Examples of the invalid OOD features extracted by CLIP. The odd-numbered rows show the original images from ImageNet-1k and the even-numbered rows show the extracted ID-irrelevant context from the corresponding images. The ground-truth labels are annotated below every even-numbered row. Although CLIP can mask out some ID-related regions (shown as the gray patches of images), large portions of the extracted OOD features (shown as the colored patches of images) obviously belong to ID features.