Table of Contents
Fetching ...

How Does Fine-Tuning Impact Out-of-Distribution Detection for Vision-Language Models?

Yifei Ming, Yixuan Li

TL;DR

This paper analyzes how parameter-efficient fine-tuning (PEFT) of vision-language models like CLIP affects out-of-distribution (OOD) detection in few-shot downstream tasks. It casts OOD detection as multi-modal concept matching between input features and ID prototypes drawn from limited labeled data and label text, and introduces PEFT-MCM, a pipeline combining prompt- or adaptor-based fine-tuning with an MCM-based OOD score. Across diverse ID/OOD datasets, the study shows that PEFT improves OOD reliability relative to zero-shot CLIP, and that the MCM scoring function with an appropriate temperature $\tau$ consistently yields strong OOD separation, often outperforming MS and MSP baselines. Prompt learning, in particular, perturbs the feature space modestly but enhances both ID accuracy and OOD detection, with larger backbones further boosting performance and the approach remaining effective even with few shots.

Abstract

Recent large vision-language models such as CLIP have shown remarkable out-of-distribution (OOD) detection and generalization performance. However, their zero-shot in-distribution (ID) accuracy is often limited for downstream datasets. Recent CLIP-based fine-tuning methods such as prompt learning have demonstrated significant improvements in ID classification and OOD generalization where OOD labels are available. Nonetheless, it remains unclear whether the model is reliable to semantic shifts without OOD labels. In this paper, we aim to bridge the gap and present a comprehensive study to understand how fine-tuning impact OOD detection for few-shot downstream tasks. By framing OOD detection as multi-modal concept matching, we establish a connection between fine-tuning methods and various OOD scores. Our results suggest that a proper choice of OOD scores is essential for CLIP-based fine-tuning. In particular, the maximum concept matching (MCM) score provides a promising solution consistently. We also show that prompt learning demonstrates the state-of-the-art OOD detection performance over the zero-shot counterpart.

How Does Fine-Tuning Impact Out-of-Distribution Detection for Vision-Language Models?

TL;DR

This paper analyzes how parameter-efficient fine-tuning (PEFT) of vision-language models like CLIP affects out-of-distribution (OOD) detection in few-shot downstream tasks. It casts OOD detection as multi-modal concept matching between input features and ID prototypes drawn from limited labeled data and label text, and introduces PEFT-MCM, a pipeline combining prompt- or adaptor-based fine-tuning with an MCM-based OOD score. Across diverse ID/OOD datasets, the study shows that PEFT improves OOD reliability relative to zero-shot CLIP, and that the MCM scoring function with an appropriate temperature consistently yields strong OOD separation, often outperforming MS and MSP baselines. Prompt learning, in particular, perturbs the feature space modestly but enhances both ID accuracy and OOD detection, with larger backbones further boosting performance and the approach remaining effective even with few shots.

Abstract

Recent large vision-language models such as CLIP have shown remarkable out-of-distribution (OOD) detection and generalization performance. However, their zero-shot in-distribution (ID) accuracy is often limited for downstream datasets. Recent CLIP-based fine-tuning methods such as prompt learning have demonstrated significant improvements in ID classification and OOD generalization where OOD labels are available. Nonetheless, it remains unclear whether the model is reliable to semantic shifts without OOD labels. In this paper, we aim to bridge the gap and present a comprehensive study to understand how fine-tuning impact OOD detection for few-shot downstream tasks. By framing OOD detection as multi-modal concept matching, we establish a connection between fine-tuning methods and various OOD scores. Our results suggest that a proper choice of OOD scores is essential for CLIP-based fine-tuning. In particular, the maximum concept matching (MCM) score provides a promising solution consistently. We also show that prompt learning demonstrates the state-of-the-art OOD detection performance over the zero-shot counterpart.
Paper Structure (17 sections, 5 equations, 8 figures, 10 tables)

This paper contains 17 sections, 5 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: A unified pipeline for OOD detection with parameter-efficient fine-tuning of CLIP models on few-shot datasets. Given ID text labels $\mathcal{Y}_\text{in}$ and a few-shot training set, we view the textual and visual embeddings of ID classes as concept prototypes in the feature space. The OOD uncertainty of an input image can be characterized by the distance from its visual feature to the closest ID prototype from both modalities. See Section \ref{['method']} for details.
  • Figure 2: The impact of softmax scaling. We use Stanford-Cars (ID) vs. SUN (OOD) for illustration. Applying softmax scaling significantly decreases ID-OOD separability for CoOp (top row), CoCoOp (second row), and TipAdaptorF (last row), resulting in worse OOD detection performance.
  • Figure 3: Average $S_{\text{MS}}$ for ID (Caltech-101) and OOD test sets. Prompt learning methods decrease the angular distance for ID inputs while increasing the angular distance for OOD inputs to the nearest concept prototype, leading to better ID-OOD separability (Figure \ref{['fig:illustrate_sep']}).
  • Figure 4: Illustration of how prompt learning methods impact the hyperspherical features. Left: feature of an ID sample and its nearest ID prototype; Right: feature of an OOD sample and its nearest ID prototype.
  • Figure 5: OOD detection performance (FPR95) on ImageNet-1k (ID). Using $S_{\text{MCM}}$ score leads to significant improvement over $S_{\text{MSP}}$.
  • ...and 3 more figures