Worst of Both Worlds: Biases Compound in Pre-trained Vision-and-Language Models
Tejas Srinivasan, Yonatan Bisk
TL;DR
This work analyzes how biases in vision and language interact in multimodal models, focusing on VL-BERT. It extends template-based bias probes from text-only LMs to visual-linguistic settings, partitioning bias into visual-linguistic pretraining, language-context, and visual-context sources. Through a controlled case study and a larger entity set, it shows a tendency toward masculine associations and demonstrates that language and visual cues can disproportionately influence predictions, sometimes overriding visual evidence. The findings highlight potential representational harms in multimodal systems and call for more inclusive data, probes, and ethical considerations in future research.
Abstract
Numerous works have analyzed biases in vision and pre-trained language models individually - however, less attention has been paid to how these biases interact in multimodal settings. This work extends text-based bias analysis methods to investigate multimodal language models, and analyzes intra- and inter-modality associations and biases learned by these models. Specifically, we demonstrate that VL-BERT (Su et al., 2020) exhibits gender biases, often preferring to reinforce a stereotype over faithfully describing the visual scene. We demonstrate these findings on a controlled case-study and extend them for a larger set of stereotypically gendered entities.
