How Does Fine-Tuning Impact Out-of-Distribution Detection for Vision-Language Models?
Yifei Ming, Yixuan Li
TL;DR
This paper analyzes how parameter-efficient fine-tuning (PEFT) of vision-language models like CLIP affects out-of-distribution (OOD) detection in few-shot downstream tasks. It casts OOD detection as multi-modal concept matching between input features and ID prototypes drawn from limited labeled data and label text, and introduces PEFT-MCM, a pipeline combining prompt- or adaptor-based fine-tuning with an MCM-based OOD score. Across diverse ID/OOD datasets, the study shows that PEFT improves OOD reliability relative to zero-shot CLIP, and that the MCM scoring function with an appropriate temperature $\tau$ consistently yields strong OOD separation, often outperforming MS and MSP baselines. Prompt learning, in particular, perturbs the feature space modestly but enhances both ID accuracy and OOD detection, with larger backbones further boosting performance and the approach remaining effective even with few shots.
Abstract
Recent large vision-language models such as CLIP have shown remarkable out-of-distribution (OOD) detection and generalization performance. However, their zero-shot in-distribution (ID) accuracy is often limited for downstream datasets. Recent CLIP-based fine-tuning methods such as prompt learning have demonstrated significant improvements in ID classification and OOD generalization where OOD labels are available. Nonetheless, it remains unclear whether the model is reliable to semantic shifts without OOD labels. In this paper, we aim to bridge the gap and present a comprehensive study to understand how fine-tuning impact OOD detection for few-shot downstream tasks. By framing OOD detection as multi-modal concept matching, we establish a connection between fine-tuning methods and various OOD scores. Our results suggest that a proper choice of OOD scores is essential for CLIP-based fine-tuning. In particular, the maximum concept matching (MCM) score provides a promising solution consistently. We also show that prompt learning demonstrates the state-of-the-art OOD detection performance over the zero-shot counterpart.
