Table of Contents
Fetching ...

CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?

Ibrahim Alabdulmohsin, Xiao Wang, Andreas Steiner, Priya Goyal, Alexander D'Amour, Xiaohua Zhai

TL;DR

It is found that fine-tuning is effective in countering representation biases, though its impact diminishes for association biases, and data balancing has a mixed impact on quality: it tends to improve classification but can hurt retrieval.

Abstract

We study the effectiveness of data-balancing for mitigating biases in contrastive language-image pretraining (CLIP), identifying areas of strength and limitation. First, we reaffirm prior conclusions that CLIP models can inadvertently absorb societal stereotypes. To counter this, we present a novel algorithm, called Multi-Modal Moment Matching (M4), designed to reduce both representation and association biases (i.e. in first- and second-order statistics) in multimodal data. We use M4 to conduct an in-depth analysis taking into account various factors, such as the model, representation, and data size. Our study also explores the dynamic nature of how CLIP learns and unlearns biases. In particular, we find that fine-tuning is effective in countering representation biases, though its impact diminishes for association biases. Also, data balancing has a mixed impact on quality: it tends to improve classification but can hurt retrieval. Interestingly, data and architectural improvements seem to mitigate the negative impact of data balancing on performance; e.g. applying M4 to SigLIP-B/16 with data quality filters improves COCO image-to-text retrieval @5 from 86% (without data balancing) to 87% and ImageNet 0-shot classification from 77% to 77.5%! Finally, we conclude with recommendations for improving the efficacy of data balancing in multimodal systems.

CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?

TL;DR

It is found that fine-tuning is effective in countering representation biases, though its impact diminishes for association biases, and data balancing has a mixed impact on quality: it tends to improve classification but can hurt retrieval.

Abstract

We study the effectiveness of data-balancing for mitigating biases in contrastive language-image pretraining (CLIP), identifying areas of strength and limitation. First, we reaffirm prior conclusions that CLIP models can inadvertently absorb societal stereotypes. To counter this, we present a novel algorithm, called Multi-Modal Moment Matching (M4), designed to reduce both representation and association biases (i.e. in first- and second-order statistics) in multimodal data. We use M4 to conduct an in-depth analysis taking into account various factors, such as the model, representation, and data size. Our study also explores the dynamic nature of how CLIP learns and unlearns biases. In particular, we find that fine-tuning is effective in countering representation biases, though its impact diminishes for association biases. Also, data balancing has a mixed impact on quality: it tends to improve classification but can hurt retrieval. Interestingly, data and architectural improvements seem to mitigate the negative impact of data balancing on performance; e.g. applying M4 to SigLIP-B/16 with data quality filters improves COCO image-to-text retrieval @5 from 86% (without data balancing) to 87% and ImageNet 0-shot classification from 77% to 77.5%! Finally, we conclude with recommendations for improving the efficacy of data balancing in multimodal systems.
Paper Structure (40 sections, 3 theorems, 20 equations, 20 figures, 7 tables, 2 algorithms)

This paper contains 40 sections, 3 theorems, 20 equations, 20 figures, 7 tables, 2 algorithms.

Key Result

Proposition 1

Algorithm alg:code terminates with an optimal solution to the optimization problem in (eq:deb_loss).

Figures (20)

  • Figure 1: top: Text-to-image models prompted for occupations, such as manager / secretary (left) or pilot / flight attendant (right) can reflect societal stereotypes. Refer to Section \ref{['sect:intro']} for the exact prompts. bottom: CLIP can encode societal stereotypes, such as by associating cars with men. See Section \ref{['sect:results']}.
  • Figure 2: top: Mean parity $\mathbb{E}[p(\mathrm{man}) - p(\mathrm{woman})]$ across images from the ILSRCV2012 dataset deng2009imagenet. Values closer to zero are better. bottom: On left, parity scores for ViT-B/16 (longer visual sequence length). On right, $p$ values calculated using Wilxocon's signed rank test wilcoxon1992individual for the null hypothesis that column has the same effect as row.
  • Figure 3: CLIP is trained on 1B examples split into two stages. On the left, it is initially trained on intervened data with proxies, before switching to the original data. On the right, it is trained on the original data before intervening. Legends indicate the fraction of time [%] assigned to Stage 1.
  • Figure 4: top: A comparison of AB (perceived gender against occupation) evaluated in three downstream datasets. bottom: ViT-B/16 results (left) and statistical analysis (right) as in Figure \ref{['fig:inet_1st_Mean']}.
  • Figure 5: A summary of how CLIP learns or unlearns association bias (shown in $y$-axis) when intervened data comprises different percentages [%] of training duration. Setup is similar to Figure \ref{['fig:inet_1st_upstream_Mean']}.
  • ...and 15 more figures

Theorems & Definitions (7)

  • Definition 1: Data Representation Bias
  • Definition 2: Data Association Bias
  • Definition 3: Model Representation Bias
  • Definition 4: Model Association Bias
  • Proposition 1
  • Proposition 2
  • Lemma 1