Table of Contents
Fetching ...

GABInsight: Exploring Gender-Activity Binding Bias in Vision-Language Models

Ali Abdollahi, Mahdi Ghaznavi, Mohammad Reza Karimi Nejad, Arash Mari Oriyad, Reza Abbasi, Ali Salesi, Melika Behjati, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah

TL;DR

It is shown that in real-world applications, VLMs are biased towards identifying the individual with the expected gender (according to ingrained gender stereotypes in the model or other forms of sample selection bias) as the performer of the activity.

Abstract

Vision-language models (VLMs) are intensively used in many downstream tasks, including those requiring assessments of individuals appearing in the images. While VLMs perform well in simple single-person scenarios, in real-world applications, we often face complex situations in which there are persons of different genders doing different activities. We show that in such cases, VLMs are biased towards identifying the individual with the expected gender (according to ingrained gender stereotypes in the model or other forms of sample selection bias) as the performer of the activity. We refer to this bias in associating an activity with the gender of its actual performer in an image or text as the Gender-Activity Binding (GAB) bias and analyze how this bias is internalized in VLMs. To assess this bias, we have introduced the GAB dataset with approximately 5500 AI-generated images that represent a variety of activities, addressing the scarcity of real-world images for some scenarios. To have extensive quality control, the generated images are evaluated for their diversity, quality, and realism. We have tested 12 renowned pre-trained VLMs on this dataset in the context of text-to-image and image-to-text retrieval to measure the effect of this bias on their predictions. Additionally, we have carried out supplementary experiments to quantify the bias in VLMs' text encoders and to evaluate VLMs' capability to recognize activities. Our experiments indicate that VLMs experience an average performance decline of about 13.2% when confronted with gender-activity binding bias.

GABInsight: Exploring Gender-Activity Binding Bias in Vision-Language Models

TL;DR

It is shown that in real-world applications, VLMs are biased towards identifying the individual with the expected gender (according to ingrained gender stereotypes in the model or other forms of sample selection bias) as the performer of the activity.

Abstract

Vision-language models (VLMs) are intensively used in many downstream tasks, including those requiring assessments of individuals appearing in the images. While VLMs perform well in simple single-person scenarios, in real-world applications, we often face complex situations in which there are persons of different genders doing different activities. We show that in such cases, VLMs are biased towards identifying the individual with the expected gender (according to ingrained gender stereotypes in the model or other forms of sample selection bias) as the performer of the activity. We refer to this bias in associating an activity with the gender of its actual performer in an image or text as the Gender-Activity Binding (GAB) bias and analyze how this bias is internalized in VLMs. To assess this bias, we have introduced the GAB dataset with approximately 5500 AI-generated images that represent a variety of activities, addressing the scarcity of real-world images for some scenarios. To have extensive quality control, the generated images are evaluated for their diversity, quality, and realism. We have tested 12 renowned pre-trained VLMs on this dataset in the context of text-to-image and image-to-text retrieval to measure the effect of this bias on their predictions. Additionally, we have carried out supplementary experiments to quantify the bias in VLMs' text encoders and to evaluate VLMs' capability to recognize activities. Our experiments indicate that VLMs experience an average performance decline of about 13.2% when confronted with gender-activity binding bias.
Paper Structure (38 sections, 4 equations, 8 figures, 7 tables)

This paper contains 38 sections, 4 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Overview of (a) the creation process of the introduced dataset and (b) the empirical tests conducted to assess the gender-activity binding bias in retrieval tasks within vision-language models. (a) Left: we gather three sets of activities that show bias, including stereotypical, everyday activities, and those that exhibit gender bias in the captions of LAION-400M schuhmann2021laion400m. Middle: we employ prompt enhancement techniques to develop a diverse, descriptive, and detailed prompt from a basic initial one, aiding us in generating a wider range of images with superior quality and realism. Right: we utilize DALL-E 3 to construct our dataset based on the enhanced prompts. The generated images are selected to align with the activity and scenario and are evaluated for diversity, quality, and realness to achieve a high score based on standard metrics. (b) Middle: joint embedding space of text and images in vision-language models. Left/Right: an overview of image-to-text/text-to-image retrieval tasks. The caption/image with the highest cosine similarity to the input image/caption is retrieved.
  • Figure 2: Average retrieval accuracy of VLMs on the image-to-text retrieval task across various scenarios. The chart highlights the performance drop between these scenarios for each model. The purple bar represents the accuracy in the scenario where the unexpected gender is performing the activity in the reference image, while the expected gender is also present in the scene. The red bar corresponds to the accuracy in the reverse scenario, where the expected gender is performing the activity. The blue bar denotes the scenario where the unexpected gender is performing the activity and is the only one present in the scene.
  • Figure 3: Average retrieval accuracy of VLMs on the image-to-text retrieval task on E1 and U1 category of images on stereotypical activities. The purple bar represents the accuracy in the scenario where the unexpected gender is performing the activity in the reference image and is alone in the scene. The blue bar denotes the scenario where the expected gender is performing the activity and is the only one present in the scene. In this task the template is “a <man/woman> is <doing activity>”.
  • Figure 4: Average accuracy of models in the text-to-image retrieval tasks. The images are sourced from the E2 and U2 groups, which are images that feature two individuals of different genders. The performance of models on the U2 and E2 groups is represented by the purple and blue bars, respectively. Additionally, the red and black dashed lines depict the average performance for the U2 and E2 groups, respectively.
  • Figure 5: Average retrieval accuracy of VLMs on the image-to-text retrieval task across various scenarios. The chart highlights the performance drop between these scenarios for each model.(In this chart, the performance of the models on gender-biased activities from LAION-400M is reported, with the experiment is similar to the one described in Figure 2 of the main text.)
  • ...and 3 more figures