Table of Contents
Fetching ...

Identifying Implicit Social Biases in Vision-Language Models

Kimia Hamidieh, Haoran Zhang, Walter Gerych, Thomas Hartvigsen, Marzyeh Ghassemi

TL;DR

A systematic analysis of the social biases that are present in CLIP, with a focus on the interaction between image and text modalities, and proposes a taxonomy of social biases called So-B-It, which contains 374 words categorized across ten types of bias.

Abstract

Vision-language models, like CLIP (Contrastive Language Image Pretraining), are becoming increasingly popular for a wide range of multimodal retrieval tasks. However, prior work has shown that large language and deep vision models can learn historical biases contained in their training sets, leading to perpetuation of stereotypes and potential downstream harm. In this work, we conduct a systematic analysis of the social biases that are present in CLIP, with a focus on the interaction between image and text modalities. We first propose a taxonomy of social biases called So-B-IT, which contains 374 words categorized across ten types of bias. Each type can lead to societal harm if associated with a particular demographic group. Using this taxonomy, we examine images retrieved by CLIP from a facial image dataset using each word as part of a prompt. We find that CLIP frequently displays undesirable associations between harmful words and specific demographic groups, such as retrieving mostly pictures of Middle Eastern men when asked to retrieve images of a "terrorist". Finally, we conduct an analysis of the source of such biases, by showing that the same harmful stereotypes are also present in a large image-text dataset used to train CLIP models for examples of biases that we find. Our findings highlight the importance of evaluating and addressing bias in vision-language models, and suggest the need for transparency and fairness-aware curation of large pre-training datasets.

Identifying Implicit Social Biases in Vision-Language Models

TL;DR

A systematic analysis of the social biases that are present in CLIP, with a focus on the interaction between image and text modalities, and proposes a taxonomy of social biases called So-B-It, which contains 374 words categorized across ten types of bias.

Abstract

Vision-language models, like CLIP (Contrastive Language Image Pretraining), are becoming increasingly popular for a wide range of multimodal retrieval tasks. However, prior work has shown that large language and deep vision models can learn historical biases contained in their training sets, leading to perpetuation of stereotypes and potential downstream harm. In this work, we conduct a systematic analysis of the social biases that are present in CLIP, with a focus on the interaction between image and text modalities. We first propose a taxonomy of social biases called So-B-IT, which contains 374 words categorized across ten types of bias. Each type can lead to societal harm if associated with a particular demographic group. Using this taxonomy, we examine images retrieved by CLIP from a facial image dataset using each word as part of a prompt. We find that CLIP frequently displays undesirable associations between harmful words and specific demographic groups, such as retrieving mostly pictures of Middle Eastern men when asked to retrieve images of a "terrorist". Finally, we conduct an analysis of the source of such biases, by showing that the same harmful stereotypes are also present in a large image-text dataset used to train CLIP models for examples of biases that we find. Our findings highlight the importance of evaluating and addressing bias in vision-language models, and suggest the need for transparency and fairness-aware curation of large pre-training datasets.

Paper Structure

This paper contains 35 sections, 1 equation, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Identifying biases in CLIP using word associations.
  • Figure 2: Flowchart demonstrating the process for image retrieval in FairFace. For each word of interest in each category, we compute its embedding with the CLIP text encoder, and retrieve the top 100 closest images by cosine similarity. We then examine the demographic distribution of retrieved images, and compute the $\texttt{C-ASC}$ score.
  • Figure 3: Normalized entropy of the top-k distribution over gender for each category in So-B-IT. Higher values indicate less gender bias. The gender bias of VL models is most stark for the occupation category. As expected, DebiasCLIP exhibits the least gender bias.
  • Figure 4: Normalized entropy of the top-k distribution over race for each category in So-B-IT. Higher values indicate less racial bias. The racial bias of VL models is most prominently seen in the religion, political, and education categories.
  • Figure 5: Intersectional bias in OAICLIP for a set of words most strongly associated with the "Male" gender.
  • ...and 3 more figures