Table of Contents
Fetching ...

Bias in the Picture: Benchmarking VLMs with Social-Cue News Images and LLM-as-Judge Assessment

Aravind Narayanan, Vahid Reza Khazaie, Shaina Raza

TL;DR

This paper investigates how visible social cues in images influence bias in vision-language models by introducing a real-world news-image benchmark of 1,343 image–question pairs annotated with demographic attributes. It employs a two-stage methodology: careful data collection/annotation and standardized prompt-based multimodal evaluation, using a GPT-4o-based judge to assess bias, relevance, and faithfulness. Key findings show that visual context shifts model outputs, bias varies across attributes and models (notably gender and occupation), and higher faithfulness does not guarantee lower bias, highlighting a crucial bias–faithfulness trade-off. The work contributes a publicly released benchmark, prompts, evaluation rubric, and code to enable fairness-aware, reproducible multimodal evaluation of VLMs.

Abstract

Large vision-language models (VLMs) can jointly interpret images and text, but they are also prone to absorbing and reproducing harmful social stereotypes when visual cues such as age, gender, race, clothing, or occupation are present. To investigate these risks, we introduce a news-image benchmark consisting of 1,343 image-question pairs drawn from diverse outlets, which we annotated with ground-truth answers and demographic attributes (age, gender, race, occupation, and sports). We evaluate a range of state-of-the-art VLMs and employ a large language model (LLM) as judge, with human verification. Our findings show that: (i) visual context systematically shifts model outputs in open-ended settings; (ii) bias prevalence varies across attributes and models, with particularly high risk for gender and occupation; and (iii) higher faithfulness does not necessarily correspond to lower bias. We release the benchmark prompts, evaluation rubric, and code to support reproducible and fairness-aware multimodal assessment.

Bias in the Picture: Benchmarking VLMs with Social-Cue News Images and LLM-as-Judge Assessment

TL;DR

This paper investigates how visible social cues in images influence bias in vision-language models by introducing a real-world news-image benchmark of 1,343 image–question pairs annotated with demographic attributes. It employs a two-stage methodology: careful data collection/annotation and standardized prompt-based multimodal evaluation, using a GPT-4o-based judge to assess bias, relevance, and faithfulness. Key findings show that visual context shifts model outputs, bias varies across attributes and models (notably gender and occupation), and higher faithfulness does not guarantee lower bias, highlighting a crucial bias–faithfulness trade-off. The work contributes a publicly released benchmark, prompts, evaluation rubric, and code to enable fairness-aware, reproducible multimodal evaluation of VLMs.

Abstract

Large vision-language models (VLMs) can jointly interpret images and text, but they are also prone to absorbing and reproducing harmful social stereotypes when visual cues such as age, gender, race, clothing, or occupation are present. To investigate these risks, we introduce a news-image benchmark consisting of 1,343 image-question pairs drawn from diverse outlets, which we annotated with ground-truth answers and demographic attributes (age, gender, race, occupation, and sports). We evaluate a range of state-of-the-art VLMs and employ a large language model (LLM) as judge, with human verification. Our findings show that: (i) visual context systematically shifts model outputs in open-ended settings; (ii) bias prevalence varies across attributes and models, with particularly high risk for gender and occupation; and (iii) higher faithfulness does not necessarily correspond to lower bias. We release the benchmark prompts, evaluation rubric, and code to support reproducible and fairness-aware multimodal assessment.

Paper Structure

This paper contains 9 sections, 3 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Dataset Construction and Evaluation Pipeline. The figure illustrates our two-stage process: (top) data sourcing, filtration, and annotation across four demographic categories (age, gender, race, profession); and (bottom) task setting and multimodal evaluation for grounding, robustness, and reasoning, with outputs scored for accuracy, bias, and faithfulness.
  • Figure 2: VLM Benchmark Summary. (A) Overall accuracy across models. (B) Attribute-level breakdown. (C) Bias vs. faithfulness trade-off.