Are foundation models for computer vision good conformal predictors?

Leo Fillioux; Julio Silva-Rodríguez; Ismail Ben Ayed; Paul-Henry Cournède; Maria Vakalopoulou; Stergios Christodoulidis; Jose Dolz

Are foundation models for computer vision good conformal predictors?

Leo Fillioux, Julio Silva-Rodríguez, Ismail Ben Ayed, Paul-Henry Cournède, Maria Vakalopoulou, Stergios Christodoulidis, Jose Dolz

TL;DR

The paper investigates whether vision foundation models provide reliable uncertainty quantification under Conformal Prediction (CP), focusing on finite-sample marginal coverage with CP. It benchmarks 11 foundation models (including DINO, DINOv2, VICReg, CLIP, MetaCLIP) across CIFAR-10/100 and ImageNet variants, using three CP methods (APS, RAPS, LAC) and analyzing calibration and few-shot adaptation. Key findings show ViT-based models conformalize well, calibration often reduces CP efficiency, and APS offers the most robust marginal and conditional coverage under distribution shifts, with few-shot adaptation further improving conformal scores in-distribution. These results guide practical deployment of uncertainty-aware vision systems by recommending APS for robust coverage while acknowledging efficiency trade-offs, and highlighting the value of CLIP-style few-shot adaptation for improved conformalization.

Abstract

Recent advances in self-supervision and contrastive learning have brought the performance of foundation models to unprecedented levels in a variety of tasks. Fueled by this progress, these models are becoming the prevailing approach for a wide array of real-world vision problems, including risk-sensitive and high-stakes applications. However, ensuring safe deployment in these scenarios requires a more comprehensive understanding of their uncertainty modeling capabilities, which has been barely explored. In this work, we delve into the behaviour of vision and vision-language foundation models under Conformal Prediction (CP), a statistical framework that provides theoretical guarantees of marginal coverage of the true class. Across extensive experiments including popular vision classification benchmarks, well-known foundation vision models, and three CP methods, our findings reveal that foundation models are well-suited for conformalization procedures, particularly those integrating Vision Transformers. We also show that calibrating the confidence predictions of these models, a popular strategy to improve their uncertainty quantification, actually leads to efficiency degradation of the conformal set on adaptive CP methods. Furthermore, few-shot adaptation of Vision-Language Models (VLMs) to downstream tasks, whose popularity is surging, enhances conformal scores compared to zero-shot predictions. Last, our empirical study exposes APS as particularly promising in the context of vision foundation models, as it does not violate the marginal coverage guarantees across multiple challenging, yet realistic scenarios.

Are foundation models for computer vision good conformal predictors?

TL;DR

Abstract

Are foundation models for computer vision good conformal predictors?

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (20)