Table of Contents
Fetching ...

BayesAdapter: enhanced uncertainty estimation in CLIP few-shot adaptation

Pablo Morales-Álvarez, Stergios Christodoulidis, Maria Vakalopoulou, Pablo Piantanida, Jose Dolz

TL;DR

BayesAdapter reframes CLIP few-shot adaptation as Bayesian inference over the adapter weights to obtain a full posterior, improving uncertainty estimation without sacrificing competitiveness in discriminative performance. By learning a variational Gaussian posterior and performing MC-based predictions, it achieves better calibration and higher reliability for high-confidence decisions across 11 datasets and two backbones. While accuracy is typically close to state-of-the-art adapters, its strength lies in uncertainty-aware prediction and selective classification, particularly as the number of shots grows. This approach enhances the safety and practicality of deploying VLM adapters in real-world tasks that require calibrated confidence and selective abstention.

Abstract

The emergence of large pre-trained vision-language models (VLMs) represents a paradigm shift in machine learning, with unprecedented results in a broad span of visual recognition tasks. CLIP, one of the most popular VLMs, has exhibited remarkable zero-shot and transfer learning capabilities in classification. To transfer CLIP to downstream tasks, adapters constitute a parameter-efficient approach that avoids backpropagation through the large model (unlike related prompt learning methods). However, CLIP adapters have been developed to target discriminative performance, and the quality of their uncertainty estimates has been overlooked. In this work we show that the discriminative performance of state-of-the-art CLIP adapters does not always correlate with their uncertainty estimation capabilities, which are essential for a safe deployment in real-world scenarios. We also demonstrate that one of such adapters is obtained through MAP inference from a more general probabilistic framework. Based on this observation we introduce BayesAdapter, which leverages Bayesian inference to estimate a full probability distribution instead of a single point, better capturing the variability inherent in the parameter space. In a comprehensive empirical evaluation we show that our approach obtains high quality uncertainty estimates in the predictions, standing out in calibration and selective classification. Our code will be publicly available upon acceptance of the paper.

BayesAdapter: enhanced uncertainty estimation in CLIP few-shot adaptation

TL;DR

BayesAdapter reframes CLIP few-shot adaptation as Bayesian inference over the adapter weights to obtain a full posterior, improving uncertainty estimation without sacrificing competitiveness in discriminative performance. By learning a variational Gaussian posterior and performing MC-based predictions, it achieves better calibration and higher reliability for high-confidence decisions across 11 datasets and two backbones. While accuracy is typically close to state-of-the-art adapters, its strength lies in uncertainty-aware prediction and selective classification, particularly as the number of shots grows. This approach enhances the safety and practicality of deploying VLM adapters in real-world tasks that require calibrated confidence and selective abstention.

Abstract

The emergence of large pre-trained vision-language models (VLMs) represents a paradigm shift in machine learning, with unprecedented results in a broad span of visual recognition tasks. CLIP, one of the most popular VLMs, has exhibited remarkable zero-shot and transfer learning capabilities in classification. To transfer CLIP to downstream tasks, adapters constitute a parameter-efficient approach that avoids backpropagation through the large model (unlike related prompt learning methods). However, CLIP adapters have been developed to target discriminative performance, and the quality of their uncertainty estimates has been overlooked. In this work we show that the discriminative performance of state-of-the-art CLIP adapters does not always correlate with their uncertainty estimation capabilities, which are essential for a safe deployment in real-world scenarios. We also demonstrate that one of such adapters is obtained through MAP inference from a more general probabilistic framework. Based on this observation we introduce BayesAdapter, which leverages Bayesian inference to estimate a full probability distribution instead of a single point, better capturing the variability inherent in the parameter space. In a comprehensive empirical evaluation we show that our approach obtains high quality uncertainty estimates in the predictions, standing out in calibration and selective classification. Our code will be publicly available upon acceptance of the paper.

Paper Structure

This paper contains 18 sections, 2 theorems, 14 equations, 5 figures, 23 tables, 1 algorithm.

Key Result

Proposition 1

Given training data $({\mathbf X},{\mathbf Y})$, maximizing the (log) posterior probablity ${\mathrm{p}}({\mathbf W}|{\mathbf X},{\mathbf Y})$ for the model in eqs.eq:prob_model_prior-eq:prob_model_lik is equivalent to minimizing the loss in eq. eq:CLAP_obj.

Figures (5)

  • Figure 1: Graphical representation of the novel BayesAdapter. Whereas CLAP estimates ${\mathbf W}$ through a single point estimate, BayesAdapter estimates a probability distribution over it.
  • Figure 2: Calibration plots for the four best methods in terms of ECE in Table \ref{['tab:acc_cal_all']} (ResNet50 backbone). A full figure with all the methods is shown in the Appendix, Sec. \ref{['sec:app_other_tab_figs']}. In each case, the lower subplot depicts the accuracy and average confidence for samples in each one of the ten bins (from 0% to 100% of confidence score by steps of 10%). Ideally, the gap between them should be zero. The upper plot shows the proportion of samples in each bin, along with the average confidence and accuracy in the whole test set.
  • Figure 3: Visualizing the over-conservative behavior of some baselines, including the most recent ones CLAP and LP++. We show the histogram of the confidence score on the test samples (after adapting in a few-shot training data). Whereas CLAP, LP++ and TipA abstain from making predictions at 99% confidence, BayesAdapter does cover 10.53% of the test set, achieving accuracy above 99%. TipA-f-, CrossModal and TaskRes also make reliable predictions at 99% confidence, achieving coverage of 0.26%, 7.16% and 0.59%, respectively.
  • Figure 4: Evolution of metrics with shots. From left to right: accuracy, calibration, and coverage at 99% confidence level. In the last one, the marker is missing if the method is not reliable at 99% confidence level. Whereas methods are similar in accuracy, more differences appear when evaluating the uncertainty estimates via calibration and selective classification. More details in the text.
  • Figure 5: Calibration plots for all the compared methods (ResNet50 backbone). In each case, the lower subplot depicts the accuracy and average confidence for samples in each one of the ten bins (from 0% to 100% of confidence score by steps of 10%). Ideally, the gap between them should be zero. The upper plot shows the proportion of samples in each bin, along with the average confidence and accuracy in the whole test set.

Theorems & Definitions (3)

  • Proposition 1
  • Proposition
  • Claim