Table of Contents
Fetching ...

Think Before You Act: A Two-Stage Framework for Mitigating Gender Bias Towards Vision-Language Tasks

Yunqi Zhang, Songda Li, Chunyuan Deng, Luyi Wang, Hui Zhao

TL;DR

This paper proposes GAMA, a task-agnostic generation framework to mitigate gender bias in vision-language models, and identifies object hallucination as the essence of gender bias in VLMs.

Abstract

Gender bias in vision-language models (VLMs) can reinforce harmful stereotypes and discrimination. In this paper, we focus on mitigating gender bias towards vision-language tasks. We identify object hallucination as the essence of gender bias in VLMs. Existing VLMs tend to focus on salient or familiar attributes in images but ignore contextualized nuances. Moreover, most VLMs rely on the co-occurrence between specific objects and gender attributes to infer the ignored features, ultimately resulting in gender bias. We propose GAMA, a task-agnostic generation framework to mitigate gender bias. GAMA consists of two stages: narrative generation and answer inference. During narrative generation, GAMA yields all-sided but gender-obfuscated narratives, which prevents premature concentration on localized image features, especially gender attributes. During answer inference, GAMA integrates the image, generated narrative, and a task-specific question prompt to infer answers for different vision-language tasks. This approach allows the model to rethink gender attributes and answers. We conduct extensive experiments on GAMA, demonstrating its debiasing and generalization ability.

Think Before You Act: A Two-Stage Framework for Mitigating Gender Bias Towards Vision-Language Tasks

TL;DR

This paper proposes GAMA, a task-agnostic generation framework to mitigate gender bias in vision-language models, and identifies object hallucination as the essence of gender bias in VLMs.

Abstract

Gender bias in vision-language models (VLMs) can reinforce harmful stereotypes and discrimination. In this paper, we focus on mitigating gender bias towards vision-language tasks. We identify object hallucination as the essence of gender bias in VLMs. Existing VLMs tend to focus on salient or familiar attributes in images but ignore contextualized nuances. Moreover, most VLMs rely on the co-occurrence between specific objects and gender attributes to infer the ignored features, ultimately resulting in gender bias. We propose GAMA, a task-agnostic generation framework to mitigate gender bias. GAMA consists of two stages: narrative generation and answer inference. During narrative generation, GAMA yields all-sided but gender-obfuscated narratives, which prevents premature concentration on localized image features, especially gender attributes. During answer inference, GAMA integrates the image, generated narrative, and a task-specific question prompt to infer answers for different vision-language tasks. This approach allows the model to rethink gender attributes and answers. We conduct extensive experiments on GAMA, demonstrating its debiasing and generalization ability.
Paper Structure (68 sections, 23 equations, 4 figures, 13 tables)

This paper contains 68 sections, 23 equations, 4 figures, 13 tables.

Figures (4)

  • Figure 1: Examples in image captioning with annotations and captions generated by a baseline model, SAT xu_show_2015. We overlay images with attention heatmaps of SAT on the right. In the top example, SAT focuses on the woman and predicts "juice", a word highly co-occurring with females. In the bottom example, the gender is incorrectly predicted, as "soccer" highly co-occurs with males in the training set.
  • Figure 2: The overall framework of GAMA. We briefly provide task-specific question prompts and answers, which are detailed in Appendix \ref{['app:impl-detail']}. We take the token probability of the decoder as the match score in image search.
  • Figure 3: Comparison on task performance and gender bias mitigation ability. We normalize the metrics separately and sum the normalized gender bias metrics and image captioning metrics, respectively.
  • Figure 4: Heatmap visualization of the co-occurrence frequency between gender attributes and certain words. We respectively select five words highly co-occurring with females and males in the training set. We show the frequency of co-occurrence between gender attributes and words in the model predictions. Darker colors indicate higher frequencies.