Table of Contents
Fetching ...

Are foundation models for computer vision good conformal predictors?

Leo Fillioux, Julio Silva-Rodríguez, Ismail Ben Ayed, Paul-Henry Cournède, Maria Vakalopoulou, Stergios Christodoulidis, Jose Dolz

TL;DR

The paper investigates whether vision foundation models provide reliable uncertainty quantification under Conformal Prediction (CP), focusing on finite-sample marginal coverage with CP. It benchmarks 11 foundation models (including DINO, DINOv2, VICReg, CLIP, MetaCLIP) across CIFAR-10/100 and ImageNet variants, using three CP methods (APS, RAPS, LAC) and analyzing calibration and few-shot adaptation. Key findings show ViT-based models conformalize well, calibration often reduces CP efficiency, and APS offers the most robust marginal and conditional coverage under distribution shifts, with few-shot adaptation further improving conformal scores in-distribution. These results guide practical deployment of uncertainty-aware vision systems by recommending APS for robust coverage while acknowledging efficiency trade-offs, and highlighting the value of CLIP-style few-shot adaptation for improved conformalization.

Abstract

Recent advances in self-supervision and contrastive learning have brought the performance of foundation models to unprecedented levels in a variety of tasks. Fueled by this progress, these models are becoming the prevailing approach for a wide array of real-world vision problems, including risk-sensitive and high-stakes applications. However, ensuring safe deployment in these scenarios requires a more comprehensive understanding of their uncertainty modeling capabilities, which has been barely explored. In this work, we delve into the behaviour of vision and vision-language foundation models under Conformal Prediction (CP), a statistical framework that provides theoretical guarantees of marginal coverage of the true class. Across extensive experiments including popular vision classification benchmarks, well-known foundation vision models, and three CP methods, our findings reveal that foundation models are well-suited for conformalization procedures, particularly those integrating Vision Transformers. We also show that calibrating the confidence predictions of these models, a popular strategy to improve their uncertainty quantification, actually leads to efficiency degradation of the conformal set on adaptive CP methods. Furthermore, few-shot adaptation of Vision-Language Models (VLMs) to downstream tasks, whose popularity is surging, enhances conformal scores compared to zero-shot predictions. Last, our empirical study exposes APS as particularly promising in the context of vision foundation models, as it does not violate the marginal coverage guarantees across multiple challenging, yet realistic scenarios.

Are foundation models for computer vision good conformal predictors?

TL;DR

The paper investigates whether vision foundation models provide reliable uncertainty quantification under Conformal Prediction (CP), focusing on finite-sample marginal coverage with CP. It benchmarks 11 foundation models (including DINO, DINOv2, VICReg, CLIP, MetaCLIP) across CIFAR-10/100 and ImageNet variants, using three CP methods (APS, RAPS, LAC) and analyzing calibration and few-shot adaptation. Key findings show ViT-based models conformalize well, calibration often reduces CP efficiency, and APS offers the most robust marginal and conditional coverage under distribution shifts, with few-shot adaptation further improving conformal scores in-distribution. These results guide practical deployment of uncertainty-aware vision systems by recommending APS for robust coverage while acknowledging efficiency trade-offs, and highlighting the value of CLIP-style few-shot adaptation for improved conformalization.

Abstract

Recent advances in self-supervision and contrastive learning have brought the performance of foundation models to unprecedented levels in a variety of tasks. Fueled by this progress, these models are becoming the prevailing approach for a wide array of real-world vision problems, including risk-sensitive and high-stakes applications. However, ensuring safe deployment in these scenarios requires a more comprehensive understanding of their uncertainty modeling capabilities, which has been barely explored. In this work, we delve into the behaviour of vision and vision-language foundation models under Conformal Prediction (CP), a statistical framework that provides theoretical guarantees of marginal coverage of the true class. Across extensive experiments including popular vision classification benchmarks, well-known foundation vision models, and three CP methods, our findings reveal that foundation models are well-suited for conformalization procedures, particularly those integrating Vision Transformers. We also show that calibrating the confidence predictions of these models, a popular strategy to improve their uncertainty quantification, actually leads to efficiency degradation of the conformal set on adaptive CP methods. Furthermore, few-shot adaptation of Vision-Language Models (VLMs) to downstream tasks, whose popularity is surging, enhances conformal scores compared to zero-shot predictions. Last, our empirical study exposes APS as particularly promising in the context of vision foundation models, as it does not violate the marginal coverage guarantees across multiple challenging, yet realistic scenarios.

Paper Structure

This paper contains 22 sections, 10 equations, 20 figures, 11 tables.

Figures (20)

  • Figure 1: Relationship between the linear probing model accuracy and conformal set size (top) and the coverage gap (bottom) across different tasks of increasing complexity. From left to right: CIFAR-10, CIFAR-100 and ImageNet.
  • Figure 2: Comparison (APS vs RAPS) of the class-conditional coverage and set size for the class for which RAPS has the worst class-conditional coverage. Experiments performed on CIFAR-100. Models sorted (in ascending order) by their LP performance ($\min=0.65$ and $\max=0.92$), indicated by the size of the circles.
  • Figure 3: $\text{ViT}_\text{ImageNet}$vs$\text{ViT}_\text{CLIP}$. Analyzing the difference in set size between a $\text{ViT}_\text{CLIP}$ and $\text{ViT}_\text{ImageNet}$. Equal set sizes not shown.
  • Figure 4: Evaluation under domain-shift. Set size($\downarrow$), coverage($\uparrow$), and MCCC($\uparrow$) across three CP methods and three foundation models. ImageNet versions are sorted based on OOD performance in clap24.
  • Figure 5: Domain shift analysis. Distribution of class-conditional coverages for CLIP on ImageNet-A: APS (left) and RAPS (right).
  • ...and 15 more figures