Table of Contents
Fetching ...

From Descriptive Richness to Bias: Unveiling the Dark Side of Generative Image Caption Enrichment

Yusuke Hirota, Ryo Hachiuma, Chao-Han Huck Yang, Yuta Nakashima

TL;DR

This study compares standard-format captions and recent GCE processes from the perspectives of gender bias and hallucination, showing that enriched captions suffer from increased gender bias and hallucination.

Abstract

Large language models (LLMs) have enhanced the capacity of vision-language models to caption visual text. This generative approach to image caption enrichment further makes textual captions more descriptive, improving alignment with the visual context. However, while many studies focus on benefits of generative caption enrichment (GCE), are there any negative side effects? We compare standard-format captions and recent GCE processes from the perspectives of "gender bias" and "hallucination", showing that enriched captions suffer from increased gender bias and hallucination. Furthermore, models trained on these enriched captions amplify gender bias by an average of 30.9% and increase hallucination by 59.5%. This study serves as a caution against the trend of making captions more descriptive.

From Descriptive Richness to Bias: Unveiling the Dark Side of Generative Image Caption Enrichment

TL;DR

This study compares standard-format captions and recent GCE processes from the perspectives of gender bias and hallucination, showing that enriched captions suffer from increased gender bias and hallucination.

Abstract

Large language models (LLMs) have enhanced the capacity of vision-language models to caption visual text. This generative approach to image caption enrichment further makes textual captions more descriptive, improving alignment with the visual context. However, while many studies focus on benefits of generative caption enrichment (GCE), are there any negative side effects? We compare standard-format captions and recent GCE processes from the perspectives of "gender bias" and "hallucination", showing that enriched captions suffer from increased gender bias and hallucination. Furthermore, models trained on these enriched captions amplify gender bias by an average of 30.9% and increase hallucination by 59.5%. This study serves as a caution against the trend of making captions more descriptive.
Paper Structure (17 sections, 4 equations, 7 figures, 2 tables)

This paper contains 17 sections, 4 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Left: an overview of our analysis. Although the "LLM-enriched" caption (ShareGPT4V) covers more content than standard COCO (objects described in captions are bolded), it exhibits hallucination (in yellow) and gender bias, including describing gender not exist in the image and possible gender-stereotypical sentence (in purple). Right: a comparison between standard and enriched captions on caption quality, bias, and hallucination.
  • Figure 2: LIC vs. Recall (left: upstream, right: downstream). The bubble size indicates vocabulary size. LIC tends to increase with higher recall, shown by strong trends (dotted lines) with $R^2 = 0.99$ (left) and $R^2 = 0.97$ (right).
  • Figure 3: $\text{CHAIR}_\text{s}$ vs. Recall (left: upstream, right: downstream). The bubble size indicates vocabulary size. $\text{CHAIR}_\text{s}$ tends to increase with higher recall, shown by strong trends with $R^2 = 0.80$ (left) and $R^2 = 0.76$ (right).
  • Figure 4: Recall disparity by visual object.
  • Figure 5: Qualitative examples of the comparison between COCO captions and ShareGPT4V. Objects described in captions are bolded. Gender bias and hallucination are highlighted in purple and yellow, respectively.
  • ...and 2 more figures