Table of Contents
Fetching ...

Can Less Precise Be More Reliable? A Systematic Evaluation of Quantization's Impact on CLIP Beyond Accuracy

Aymen Bouguerra, Daniel Montoya, Alexandra Gomez-Villa, Fabio Arnez, Chokri Mraidha

TL;DR

This work demonstrates that quantization consistently improves calibration for typically underconfident pre-trained models, while often degrading it for overconfident variants, and identifies specific quantization-aware training methods that yield simultaneous gains in zero-shot accuracy, calibration, and OOD robustness.

Abstract

The powerful zero-shot generalization capabilities of vision-language models (VLMs) like CLIP have enabled new paradigms for safety-related tasks such as out-of-distribution (OOD) detection. However, additional aspects crucial for the computationally efficient and reliable deployment of CLIP are still overlooked. In particular, the impact of quantization on CLIP's performance beyond accuracy remains underexplored. This work presents a large-scale evaluation of quantization on CLIP models, assessing not only in-distribution accuracy but a comprehensive suite of reliability metrics and revealing counterintuitive results driven by pre-training source. We demonstrate that quantization consistently improves calibration for typically underconfident pre-trained models, while often degrading it for overconfident variants. Intriguingly, this degradation in calibration does not preclude gains in other reliability metrics; we find that OOD detection can still improve for these same poorly calibrated models. Furthermore, we identify specific quantization-aware training (QAT) methods that yield simultaneous gains in zero-shot accuracy, calibration, and OOD robustness, challenging the view of a strict efficiency-performance trade-off. These findings offer critical insights for navigating the multi-objective problem of deploying efficient, reliable, and robust VLMs by utilizing quantization beyond its conventional role.

Can Less Precise Be More Reliable? A Systematic Evaluation of Quantization's Impact on CLIP Beyond Accuracy

TL;DR

This work demonstrates that quantization consistently improves calibration for typically underconfident pre-trained models, while often degrading it for overconfident variants, and identifies specific quantization-aware training methods that yield simultaneous gains in zero-shot accuracy, calibration, and OOD robustness.

Abstract

The powerful zero-shot generalization capabilities of vision-language models (VLMs) like CLIP have enabled new paradigms for safety-related tasks such as out-of-distribution (OOD) detection. However, additional aspects crucial for the computationally efficient and reliable deployment of CLIP are still overlooked. In particular, the impact of quantization on CLIP's performance beyond accuracy remains underexplored. This work presents a large-scale evaluation of quantization on CLIP models, assessing not only in-distribution accuracy but a comprehensive suite of reliability metrics and revealing counterintuitive results driven by pre-training source. We demonstrate that quantization consistently improves calibration for typically underconfident pre-trained models, while often degrading it for overconfident variants. Intriguingly, this degradation in calibration does not preclude gains in other reliability metrics; we find that OOD detection can still improve for these same poorly calibrated models. Furthermore, we identify specific quantization-aware training (QAT) methods that yield simultaneous gains in zero-shot accuracy, calibration, and OOD robustness, challenging the view of a strict efficiency-performance trade-off. These findings offer critical insights for navigating the multi-objective problem of deploying efficient, reliable, and robust VLMs by utilizing quantization beyond its conventional role.

Paper Structure

This paper contains 53 sections, 1 equation, 22 figures, 17 tables.

Figures (22)

  • Figure 1: The dichotomous impact of quantization on zero-shot Performance. WIT models (blue) consistently improve in calibration (left), with several QAT methods achieving simultaneous accuracy gains. In contrast, LAION models (orange) show systematic degradation in calibration (right). The lack of points near the origin suggests that quantization is always impactful.
  • Figure 2: Average In-distribution accuracy change for WIT (blue) and LAION (orange) sources for both ViT/B-32 and ViT/L-14 backbones under various quantization methods, relative to the FP32 baseline (0%).
  • Figure 3: Robustness to Decreasing Quantization Precision. Average ID accuracy vs. bit-width. While simpler QAT methods collapse at 4-bit precision, advanced methods are more robust.
  • Figure 4: Accuracy evolution of quantized model accuracy relative to iteration steps; higher precision models exhibit catastrophic forgetting behavior while lower precision models struggle to recuperate.
  • Figure 5: Impact of QAT Methods on CLIP Model Calibration on ViT-B/32. Violin plots show Relative ECE Change (%), comparing LAION (left, blue) and WIT (right, orange) pre-training. Negative values signify calibration improvement. Black crosses represent runs with less than 2% accuracy degradation. Please refer to our appendix for ViT-B/16 and ViT-L/14 results.
  • ...and 17 more figures