A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models

Ashutosh Sathe; Prachi Jain; Sunayana Sitaram

A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models

Ashutosh Sathe, Prachi Jain, Sunayana Sitaram

TL;DR

The paper tackles societal biases in vision-language models by introducing a unified evaluation framework that spans all four inference modalities (image-to-text, text-to-text, text-to-image, image-to-image) and biases across gender, race, and age. It creates a bias-bleached action-based dataset of 1016 profession-centered prompts and uses humanoid robots in neutral images to isolate task context from appearance, coupled with a novel Neutrality metric to quantify bias across class pairs. Through systematic probing—Direct/Indirect and Blind/Informed—the study reveals modality- and attribute-specific bias patterns, showing proprietary models generally exhibit lower bias while multi-modal models like CoDi can manifest pronounced biases in all directions. The work provides a practical resource (dataset and code) and a solid methodological foundation for bias mitigation in VLMs, with implications for safer, fairer AI deployment across real-world professional contexts.

Abstract

Vision-language models (VLMs) have gained widespread adoption in both industry and academia. In this study, we propose a unified framework for systematically evaluating gender, race, and age biases in VLMs with respect to professions. Our evaluation encompasses all supported inference modes of the recent VLMs, including image-to-text, text-to-text, text-to-image, and image-to-image. Additionally, we propose an automated pipeline to generate high-quality synthetic datasets that intentionally conceal gender, race, and age information across different professional domains, both in generated text and images. The dataset includes action-based descriptions of each profession and serves as a benchmark for evaluating societal biases in vision-language models (VLMs). In our comparative analysis of widely used VLMs, we have identified that varying input-output modalities lead to discernible differences in bias magnitudes and directions. Additionally, we find that VLM models exhibit distinct biases across different bias attributes we investigated. We hope our work will help guide future progress in improving VLMs to learn socially unbiased representations. We will release our data and code.

A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models

TL;DR

Abstract

Paper Structure (27 sections, 2 equations, 15 figures, 23 tables)

This paper contains 27 sections, 2 equations, 15 figures, 23 tables.

Introduction
Related Work
Action-based dataset
VLM Evaluation Framework
Dataset construction
Quantifying bias
Model probing techniques
Direct vs Indirect
Blind vs Informed
Experiments
Image-to-Text
Text-to-Text
Text-to-Image
Image-to-Image
Overall VLM Bias
...and 12 more sections

Figures (15)

Figure 1: Samples of generated humanoid images.
Figure 2: All the models we evaluate across various directions. The Y-axis is the input while X-axis is the output dimension.
Figure 3: Generating professional actions using GPT-4.
Figure 4: A filtering process is applied to low-quality prompts obtained from Figure \ref{['fig:promptgen']}. If a prompt fails to enable a generative model to re-generate the original profession mentioned in the parent prompt (Figure \ref{['fig:promptgen']}), it is filtered out.
Figure 5: Prompt used for 'Blind Direct' probing in the image-to-text direction.
...and 10 more figures

A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models

TL;DR

Abstract

A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (15)