Table of Contents
Fetching ...

You Never Know: Quantization Induces Inconsistent Biases in Vision-Language Foundation Models

Eric Slyman, Anirudh Kanneganti, Sanghyun Hong, Stefan Lee

TL;DR

An extensive evaluation of four quantization settings across three datasets and three CLIP variants yields a surprising result: while individual models demonstrate bias, there is no consistent change in bias magnitude or direction across a population of compressed models due to quantization.

Abstract

We study the impact of a standard practice in compressing foundation vision-language models - quantization - on the models' ability to produce socially-fair outputs. In contrast to prior findings with unimodal models that compression consistently amplifies social biases, our extensive evaluation of four quantization settings across three datasets and three CLIP variants yields a surprising result: while individual models demonstrate bias, we find no consistent change in bias magnitude or direction across a population of compressed models due to quantization.

You Never Know: Quantization Induces Inconsistent Biases in Vision-Language Foundation Models

TL;DR

An extensive evaluation of four quantization settings across three datasets and three CLIP variants yields a surprising result: while individual models demonstrate bias, there is no consistent change in bias magnitude or direction across a population of compressed models due to quantization.

Abstract

We study the impact of a standard practice in compressing foundation vision-language models - quantization - on the models' ability to produce socially-fair outputs. In contrast to prior findings with unimodal models that compression consistently amplifies social biases, our extensive evaluation of four quantization settings across three datasets and three CLIP variants yields a surprising result: while individual models demonstrate bias, we find no consistent change in bias magnitude or direction across a population of compressed models due to quantization.

Paper Structure

This paper contains 5 sections, 1 equation, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Zero-shot image classification accuracy on ImageNet1K deng2009imagenet and text-image retrieval recall on COCO Captions lin2014coco across varied CLIP versions, training data sources, and quantization methods. Higher ($\uparrow$) is better in all cases. HuggingFace-based quantization methods preserve performance while the PyTorch-based method shows a reduction across metrics.