Table of Contents
Fetching ...

Worst of Both Worlds: Biases Compound in Pre-trained Vision-and-Language Models

Tejas Srinivasan, Yonatan Bisk

TL;DR

This work analyzes how biases in vision and language interact in multimodal models, focusing on VL-BERT. It extends template-based bias probes from text-only LMs to visual-linguistic settings, partitioning bias into visual-linguistic pretraining, language-context, and visual-context sources. Through a controlled case study and a larger entity set, it shows a tendency toward masculine associations and demonstrates that language and visual cues can disproportionately influence predictions, sometimes overriding visual evidence. The findings highlight potential representational harms in multimodal systems and call for more inclusive data, probes, and ethical considerations in future research.

Abstract

Numerous works have analyzed biases in vision and pre-trained language models individually - however, less attention has been paid to how these biases interact in multimodal settings. This work extends text-based bias analysis methods to investigate multimodal language models, and analyzes intra- and inter-modality associations and biases learned by these models. Specifically, we demonstrate that VL-BERT (Su et al., 2020) exhibits gender biases, often preferring to reinforce a stereotype over faithfully describing the visual scene. We demonstrate these findings on a controlled case-study and extend them for a larger set of stereotypically gendered entities.

Worst of Both Worlds: Biases Compound in Pre-trained Vision-and-Language Models

TL;DR

This work analyzes how biases in vision and language interact in multimodal models, focusing on VL-BERT. It extends template-based bias probes from text-only LMs to visual-linguistic settings, partitioning bias into visual-linguistic pretraining, language-context, and visual-context sources. Through a controlled case study and a larger entity set, it shows a tendency toward masculine associations and demonstrates that language and visual cues can disproportionately influence predictions, sometimes overriding visual evidence. The findings highlight potential representational harms in multimodal systems and call for more inclusive data, probes, and ethical considerations in future research.

Abstract

Numerous works have analyzed biases in vision and pre-trained language models individually - however, less attention has been paid to how these biases interact in multimodal settings. This work extends text-based bias analysis methods to investigate multimodal language models, and analyzes intra- and inter-modality associations and biases learned by these models. Specifically, we demonstrate that VL-BERT (Su et al., 2020) exhibits gender biases, often preferring to reinforce a stereotype over faithfully describing the visual scene. We demonstrate these findings on a controlled case-study and extend them for a larger set of stereotypically gendered entities.

Paper Structure

This paper contains 21 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Visual-linguistic models (like VL-BERT) encode gender biases, which (as is the case above) may lead the model to ignore the visual signal in favor of gendered stereotypes.
  • Figure 2: Pre-training association shift scores $S_{PT}(E, m/f)$. Positive shift scores indicate that VL-BERT has higher associations between the entity and the agent's gender than BERT, and vice versa
  • Figure 3: Language association scores $S_L(E, m/f)$. Positive association scores indicate that the agent's gender increases the model's confidence in the entity.
  • Figure 4: Visual association scores $S_V(E, m/f)$. Positive association scores indicate that the model becomes more confident in the presence of a visual context.
  • Figure 5: