Table of Contents
Fetching ...

A Closer Look at the Robustness of Contrastive Language-Image Pre-Training (CLIP)

Weijie Tu, Weijian Deng, Tom Gedeon

TL;DR

This work systematically evaluates CLIP's robustness beyond standard accuracy by examining visual-factor resilience, OOD detection, and predictive uncertainty across a large suite of models and training conditions. By analyzing 83 CLIP models and 127 ImageNet baselines under 10 visual factors, 5 OOD scenarios, and 8 test conditions, the study reveals that CLIP often exhibits stronger factor-level robustness but is not universally calibrated relative to non-CLIP models. It further finds that training source and fine-tuning strategies profoundly shape safety-related properties, and that temperature scaling can improve calibration and even transfer ID calibration benefits to OOD data. These findings illuminate the importance of training-data design for developing more robust and reliable CLIP-based systems in real-world settings.

Abstract

Contrastive Language-Image Pre-training (CLIP) models have demonstrated remarkable generalization capabilities across multiple challenging distribution shifts. However, there is still much to be explored in terms of their robustness to the variations of specific visual factors. In real-world applications, reliable and safe systems must consider other safety objectives beyond classification accuracy, such as predictive uncertainty. Yet, the effectiveness of CLIP models on such safety-related features is less-explored. Driven by the above, this work comprehensively investigates the safety objectives of CLIP models, specifically focusing on three key properties: resilience to visual factor variations, calibrated uncertainty estimations, and the ability to detect anomalous inputs. To this end, we study 83 CLIP models and 127 ImageNet classifiers. They are diverse in architecture, (pre)training distribution and training strategies. We consider 10 visual factors (e.g., shape and pattern), 5 types of out-of-distribution data, and 8 natural and challenging test conditions with different shift types, such as texture, style, and perturbation shifts. Our study has unveiled several previously unknown insights into CLIP models. For instance, they are not consistently more calibrated than other ImageNet models, which contradicts existing findings. Additionally, our analysis underscores the significance of training source design by showcasing its profound influence on the three safety-related properties. We believe our comprehensive study can shed light on and help guide the development of more robust and reliable CLIP models.

A Closer Look at the Robustness of Contrastive Language-Image Pre-Training (CLIP)

TL;DR

This work systematically evaluates CLIP's robustness beyond standard accuracy by examining visual-factor resilience, OOD detection, and predictive uncertainty across a large suite of models and training conditions. By analyzing 83 CLIP models and 127 ImageNet baselines under 10 visual factors, 5 OOD scenarios, and 8 test conditions, the study reveals that CLIP often exhibits stronger factor-level robustness but is not universally calibrated relative to non-CLIP models. It further finds that training source and fine-tuning strategies profoundly shape safety-related properties, and that temperature scaling can improve calibration and even transfer ID calibration benefits to OOD data. These findings illuminate the importance of training-data design for developing more robust and reliable CLIP-based systems in real-world settings.

Abstract

Contrastive Language-Image Pre-training (CLIP) models have demonstrated remarkable generalization capabilities across multiple challenging distribution shifts. However, there is still much to be explored in terms of their robustness to the variations of specific visual factors. In real-world applications, reliable and safe systems must consider other safety objectives beyond classification accuracy, such as predictive uncertainty. Yet, the effectiveness of CLIP models on such safety-related features is less-explored. Driven by the above, this work comprehensively investigates the safety objectives of CLIP models, specifically focusing on three key properties: resilience to visual factor variations, calibrated uncertainty estimations, and the ability to detect anomalous inputs. To this end, we study 83 CLIP models and 127 ImageNet classifiers. They are diverse in architecture, (pre)training distribution and training strategies. We consider 10 visual factors (e.g., shape and pattern), 5 types of out-of-distribution data, and 8 natural and challenging test conditions with different shift types, such as texture, style, and perturbation shifts. Our study has unveiled several previously unknown insights into CLIP models. For instance, they are not consistently more calibrated than other ImageNet models, which contradicts existing findings. Additionally, our analysis underscores the significance of training source design by showcasing its profound influence on the three safety-related properties. We believe our comprehensive study can shed light on and help guide the development of more robust and reliable CLIP models.
Paper Structure (15 sections, 5 figures, 2 tables)

This paper contains 15 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The models performance on the subset of ImageNet-X annotated with a given visual factor (y-axis) to their overall accuracy on the whole ImageNet-X (x-axis). Each point represents a model. The x-axis and y-axis are probit transformed following taori2020measuring. The black dashed line represents the ideal robust models whose performance on each visual factor is the same as the overall performance. The blue straight lines are fit with robust linear regression huber2011robust. We include models supervised on ImageNet-1K, pre-trained on more data, contrastive learning models, CLIP models trained on two data distributions and their fine-tuned counterparts. We find that CLIP are generally more robust on six out of ten factors, but are less robust against Pose than other groups of models.
  • Figure 2: Shape bias analysis of CLIP, CLIP fine-tuned (CLIP-FT), models pre-trained on more data (Pretrain), and standard models. Large points mean larger models within the group. We observe that CLIP models are more shape-biased.
  • Figure 3: OOD sample identification capability of models vs. ID dataset classification accuracy. The OOD detection ability is measured by AUROC ($\uparrow$) and FPR@95 ($\downarrow$). Each point represents a model. We plot the results on iNaturalist, SUN, PLACES, TEXTURE and ImageNet-O. The blue straight lines are fit with robust linear regression huber2011robust. We observe that training distribution has a greater impact than training dataset quantity on the OOD detection performance of CLIP. Moreover, after additionally fine-tuning on ImageNet-12K, CLIP are generally better at detecting OOD samples than those directly fine-tuned on ImageNet-1K.
  • Figure 4: Model calibration performance with respect to their classification accuracy. We report results on in-distribution test set, ImageNet-V2-A, ImageNet-R and ImageNet-A. Two metrics are considered: ECE ($\downarrow$) and NLL ($\downarrow$), we also include calibration performance after calibration with temperature scaling. Each point represents a model. We use colors to represent model groups. For zero-shot CLIP, we additionally use shapes to indicate training distribution and quantity. We observe that CLIP models could be less calibrated than standard models. The training distribution and quantity are the key factors influencing the calibration performance of CLIP models. Temperature scaling reveals a consistent trend of CLIP models, and they tend to lie on a distinct trend from other models.
  • Figure 5: Influence of test time prompt on CLIP on robustness to visual factors, OOD detection, and predictive uncertainty. We include five CLIP models trained on WIT. We use different colors to denote different model architectures and utilize various shapes to represent deployed prompt sets. The dashed grey line is fit with robust linear regression huber2011robust by the original CLIP-WIT models using $80$ prompts. We see that the prompts of sizes $1, 5$ and $30$ decrease the classification performance of CLIP, but may not change the visual factor robustness of CLIP.