Table of Contents
Fetching ...

Toward a Holistic Evaluation of Robustness in CLIP Models

Weijie Tu, Weijian Deng, Tom Gedeon

TL;DR

This paper presents a holistic robustness evaluation for CLIP models, extending beyond accuracy to cover visual-factor robustness, OOD detection, calibration, zero-shot retrieval, 3D awareness, and vision–language encoder interactions. Using a large-scale, multi-faceted experimental design, it analyzes 84 zero-shot CLIP models, 44 ImageNet-finetuned CLIP models, 127 ImageNet baselines, and LLaVA variants across diverse data sources and evaluation benchmarks. Key findings include the strong factor-level robustness of CLIP relative to baselines, the persistent shape-bias in zero-shot but its attenuation with fine-tuning, and pronounced effects of training distribution, fine-tuning strategies, and test-time prompts on safety-related objectives. The work provides practical guidance for designing robust Vision-Language models and highlights the need for multi-dimensional evaluation to ensure reliability in real-world deployments.

Abstract

Contrastive Language-Image Pre-training (CLIP) models have shown significant potential, particularly in zero-shot classification across diverse distribution shifts. Building on existing evaluations of overall classification robustness, this work aims to provide a more comprehensive assessment of CLIP by introducing several new perspectives. First, we investigate their robustness to variations in specific visual factors. Second, we assess two critical safety objectives--confidence uncertainty and out-of-distribution detection--beyond mere classification accuracy. Third, we evaluate the finesse with which CLIP models bridge the image and text modalities. Fourth, we extend our examination to 3D awareness in CLIP models, moving beyond traditional 2D image understanding. Finally, we explore the interaction between vision and language encoders within modern large multimodal models (LMMs) that utilize CLIP as the visual backbone, focusing on how this interaction impacts classification robustness. In each aspect, we consider the impact of six factors on CLIP models: model architecture, training distribution, training set size, fine-tuning, contrastive loss, and test-time prompts. Our study uncovers several previously unknown insights into CLIP. For instance, the architecture of the visual encoder in CLIP plays a significant role in their robustness against 3D corruption. CLIP models tend to exhibit a bias towards shape when making predictions. Moreover, this bias tends to diminish after fine-tuning on ImageNet. Vision-language models like LLaVA, leveraging the CLIP vision encoder, could exhibit benefits in classification performance for challenging categories over CLIP alone. Our findings are poised to offer valuable guidance for enhancing the robustness and reliability of CLIP models.

Toward a Holistic Evaluation of Robustness in CLIP Models

TL;DR

This paper presents a holistic robustness evaluation for CLIP models, extending beyond accuracy to cover visual-factor robustness, OOD detection, calibration, zero-shot retrieval, 3D awareness, and vision–language encoder interactions. Using a large-scale, multi-faceted experimental design, it analyzes 84 zero-shot CLIP models, 44 ImageNet-finetuned CLIP models, 127 ImageNet baselines, and LLaVA variants across diverse data sources and evaluation benchmarks. Key findings include the strong factor-level robustness of CLIP relative to baselines, the persistent shape-bias in zero-shot but its attenuation with fine-tuning, and pronounced effects of training distribution, fine-tuning strategies, and test-time prompts on safety-related objectives. The work provides practical guidance for designing robust Vision-Language models and highlights the need for multi-dimensional evaluation to ensure reliability in real-world deployments.

Abstract

Contrastive Language-Image Pre-training (CLIP) models have shown significant potential, particularly in zero-shot classification across diverse distribution shifts. Building on existing evaluations of overall classification robustness, this work aims to provide a more comprehensive assessment of CLIP by introducing several new perspectives. First, we investigate their robustness to variations in specific visual factors. Second, we assess two critical safety objectives--confidence uncertainty and out-of-distribution detection--beyond mere classification accuracy. Third, we evaluate the finesse with which CLIP models bridge the image and text modalities. Fourth, we extend our examination to 3D awareness in CLIP models, moving beyond traditional 2D image understanding. Finally, we explore the interaction between vision and language encoders within modern large multimodal models (LMMs) that utilize CLIP as the visual backbone, focusing on how this interaction impacts classification robustness. In each aspect, we consider the impact of six factors on CLIP models: model architecture, training distribution, training set size, fine-tuning, contrastive loss, and test-time prompts. Our study uncovers several previously unknown insights into CLIP. For instance, the architecture of the visual encoder in CLIP plays a significant role in their robustness against 3D corruption. CLIP models tend to exhibit a bias towards shape when making predictions. Moreover, this bias tends to diminish after fine-tuning on ImageNet. Vision-language models like LLaVA, leveraging the CLIP vision encoder, could exhibit benefits in classification performance for challenging categories over CLIP alone. Our findings are poised to offer valuable guidance for enhancing the robustness and reliability of CLIP models.
Paper Structure (24 sections, 13 figures, 4 tables)

This paper contains 24 sections, 13 figures, 4 tables.

Figures (13)

  • Figure 1: The models' performance on the subset of ImageNet-X annotated with a given visual factor (y-axis) to their overall accuracy on the whole ImageNet-X (x-axis). Each point represents a model. The x-axis and y-axis are probit transformed following taori2020measuring. The black dashed line represents the ideal robust models whose performance on each visual factor is the same as the overall performance. The blue straight lines are fit with robust linear regression huber2011robust. We include models supervised on ImageNet-1K, pre-trained on more data, contrastive learning models, CLIP models trained on two data distributions, and their fine-tuned counterparts.
  • Figure 2: Shape bias analysis of CLIP, CLIP fine-tuned (CLIP-FT), models pre-trained on more data (Pretrain), and standard models. Large points mean larger models within the group. We observe that CLIP models are more shape-biased.
  • Figure 3: The influence of input resolution on shape bias when fine-tuning CLIP. We also report accuracy on ImageNet-Val(idation) and Stylized ImageNet (SIN). The higher value in a model pair is in bold. With the same backbone architecture, the CLIP model fine-tuned with a larger input resolution is more accurate on IN-Val but less shape-biased and less accurate on SIN.
  • Figure 4: OOD sample identification capability of models vs. ID dataset classification accuracy. The OOD detection ability is measured by AUROC ($\uparrow$) and FPR@95 ($\downarrow$). Each point represents a model. We plot the results on iNaturalist, PLACES and NINCO. The blue straight lines are fit with robust linear regression huber2011robust. We report spearman's rank correlation and $R^2$ to quantify the correlation strength between ID accuracy and OOD detection performance for zero-shot CLIP trained on WIT and LAION. The x-axis and y-axis are probit transformed following taori2020measuring. We observe that training distribution has a greater impact than training dataset quantity on the OOD detection performance of CLIP. Moreover, after additionally fine-tuning on ImageNet-12K, CLIP models are generally better at detecting OOD samples than those only fine-tuned on ImageNet-1K.
  • Figure 5: Model calibration performance with respect to classification accuracy. We report results on in-distribution test set, ImageNet-V2-A, ImageNet-R, and ImageNet-A. Two metrics are considered: ECE ($\downarrow$) and NLL ($\downarrow$), we also include calibration performance after calibration with temperature scaling. Each point represents a model. We use colors to represent model groups. For zero-shot CLIP, we additionally use shapes to indicate training distribution and quantity. CLIP models can have higher ECE than standard models. Also, the training distribution and quantity are the key factors influencing the calibration performance of CLIP models. Moreover, temperature scaling reveals a consistent trend in CLIP models. After using temperature scaling for both CLIP and other models, CLIP models follow a distinct trend from others and show better calibration performance
  • ...and 8 more figures