Table of Contents
Fetching ...

Intra-Class Probabilistic Embeddings for Uncertainty Estimation in Vision-Language Models

Zhenxiang Lin, Maryam Haghighat, Will Browne, Dimity Miller

TL;DR

Vision-language models often produce overconfident errors, which undermines safety-critical use. The authors propose ICPE, a training-free post-hoc uncertainty method that builds a dictionary of per-class Gaussian embeddings in a PCA-reduced visual space to capture intra-class feature distributions, and combines these intra-class likelihoods with inter-modal similarities to detect errors. Across five datasets and multiple backbones, ICPE achieves state-of-the-art error detection, demonstrates robustness to distribution shift and low data regimes, and highlights the importance of PCA-based stabilization of covariances. The work also discusses limitations under severe distribution shifts and edge-device constraints, suggesting directions for future improvement.

Abstract

Vision-language models (VLMs), such as CLIP, have gained popularity for their strong open vocabulary classification performance, but they are prone to assigning high confidence scores to misclassifications, limiting their reliability in safety-critical applications. We introduce a training-free, post-hoc uncertainty estimation method for contrastive VLMs that can be used to detect erroneous predictions. The key to our approach is to measure visual feature consistency within a class, using feature projection combined with multivariate Gaussians to create class-specific probabilistic embeddings. Our method is VLM-agnostic, requires no fine-tuning, demonstrates robustness to distribution shift, and works effectively with as few as 10 training images per class. Extensive experiments on ImageNet, Flowers102, Food101, EuroSAT and DTD show state-of-the-art error detection performance, significantly outperforming both deterministic and probabilistic VLM baselines. Code is available at https://github.com/zhenxianglin/ICPE.

Intra-Class Probabilistic Embeddings for Uncertainty Estimation in Vision-Language Models

TL;DR

Vision-language models often produce overconfident errors, which undermines safety-critical use. The authors propose ICPE, a training-free post-hoc uncertainty method that builds a dictionary of per-class Gaussian embeddings in a PCA-reduced visual space to capture intra-class feature distributions, and combines these intra-class likelihoods with inter-modal similarities to detect errors. Across five datasets and multiple backbones, ICPE achieves state-of-the-art error detection, demonstrates robustness to distribution shift and low data regimes, and highlights the importance of PCA-based stabilization of covariances. The work also discusses limitations under severe distribution shifts and edge-device constraints, suggesting directions for future improvement.

Abstract

Vision-language models (VLMs), such as CLIP, have gained popularity for their strong open vocabulary classification performance, but they are prone to assigning high confidence scores to misclassifications, limiting their reliability in safety-critical applications. We introduce a training-free, post-hoc uncertainty estimation method for contrastive VLMs that can be used to detect erroneous predictions. The key to our approach is to measure visual feature consistency within a class, using feature projection combined with multivariate Gaussians to create class-specific probabilistic embeddings. Our method is VLM-agnostic, requires no fine-tuning, demonstrates robustness to distribution shift, and works effectively with as few as 10 training images per class. Extensive experiments on ImageNet, Flowers102, Food101, EuroSAT and DTD show state-of-the-art error detection performance, significantly outperforming both deterministic and probabilistic VLM baselines. Code is available at https://github.com/zhenxianglin/ICPE.

Paper Structure

This paper contains 19 sections, 9 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Illustration of different uncertainty estimation paradigms for vision-language models (VLMs). (a) The standard image–text cosine similarity approach assigns high confidence to the Alaskan image (green star) due to its high similarity with the "Husky" text embedding, despite being a misclassification. (b) Image–image similarity methods estimate uncertainty based on proximity to the class mean (yellow circle), but ignore the feature distribution, leading to unreliable scores. (c) Our method models intra-class distributions using image features and assigns higher uncertainty when a test sample deviates from the class distribution, enabling more accurate uncertainty estimation.
  • Figure 2: Overview of the proposed retrieval-augmented uncertainty estimation pipeline, which utilises a dictionary of intra-class probabilistic distributions to estimate uncertainty.
  • Figure 3: Visualization of our uncertainty on ImageNet deng2009imagenet. Green indicates our method correctly distinguished between correct and error, and red indicates an incorrect distinction. The uncertainty threshold for error rejection is 0.5.
  • Figure 4: Testing on ImageNet with a CLIP ViT-B/16, our method achieves SOTA with 10 labeled images per class.
  • Figure 5: For our probabilistic embeddings, application of PCA mitigates ill-conditioned covariance matrices, measured by the log condition number of the covariance matrices for each class.
  • ...and 1 more figures