ViSAGe: A Global-Scale Analysis of Visual Stereotypes in Text-to-Image Generation

Akshita Jha; Vinodkumar Prabhakaran; Remi Denton; Sarah Laszlo; Shachi Dave; Rida Qadri; Chandan K. Reddy; Sunipa Dev

ViSAGe: A Global-Scale Analysis of Visual Stereotypes in Text-to-Image Generation

Akshita Jha, Vinodkumar Prabhakaran, Remi Denton, Sarah Laszlo, Shachi Dave, Rida Qadri, Chandan K. Reddy, Sunipa Dev

TL;DR

ViSAGe presents a scalable, globally informed framework to evaluate visual stereotypes in Text-to-Image generation by grounding visual attributes in SeeGULL, a large textual stereotype resource. The approach first isolates visually depictable stereotypes, then assesses their presence in generated images via both human annotations and automated methods (CLIP+BART with salience scoring). The study reveals that generated images are, on average, three times more likely to depict stereotypical attributes than random non-stereotypical ones, with offensiveness peaking for groups from Africa, South America, and Southeast Asia, and shows a pervasive “stereotypical pull” across most identity groups. These findings underscore the need for global, nuanced evaluation pipelines and responsible design choices in T2I systems, and ViSAGe provides a publicly released dataset and methodology to support ongoing safety interventions and bias mitigation.

Abstract

Recent studies have shown that Text-to-Image (T2I) model generations can reflect social stereotypes present in the real world. However, existing approaches for evaluating stereotypes have a noticeable lack of coverage of global identity groups and their associated stereotypes. To address this gap, we introduce the ViSAGe (Visual Stereotypes Around the Globe) dataset to enable the evaluation of known nationality-based stereotypes in T2I models, across 135 nationalities. We enrich an existing textual stereotype resource by distinguishing between stereotypical associations that are more likely to have visual depictions, such as `sombrero', from those that are less visually concrete, such as 'attractive'. We demonstrate ViSAGe's utility through a multi-faceted evaluation of T2I generations. First, we show that stereotypical attributes in ViSAGe are thrice as likely to be present in generated images of corresponding identities as compared to other attributes, and that the offensiveness of these depictions is especially higher for identities from Africa, South America, and South East Asia. Second, we assess the stereotypical pull of visual depictions of identity groups, which reveals how the 'default' representations of all identity groups in ViSAGe have a pull towards stereotypical depictions, and that this pull is even more prominent for identity groups from the Global South. CONTENT WARNING: Some examples contain offensive stereotypes.

ViSAGe: A Global-Scale Analysis of Visual Stereotypes in Text-to-Image Generation

TL;DR

Abstract

Paper Structure (28 sections, 2 equations, 8 figures, 5 tables)

This paper contains 28 sections, 2 equations, 8 figures, 5 tables.

Introduction
Related Work
Stereotypes in Text-to-Image Models
Stereotype Benchmarks in Textual Modality
Our Approach
Identifying Visual Stereotypes
Annotating Visual Attributes
Mapping to Visual Stereotypes
Detecting Visual Stereotypes in Text-to-Image Generation
Detecting Stereotypes through Human Annotations
Detecting Stereotypes through Automated Methods
Study 1: Stereotypical Depictions
Stereotypes Identified through Human Annotations
Do generated images reflect known stereotypes?
Are some identity groups depicted more stereotypically than others?
...and 13 more sections

Figures (8)

Figure 1: We identify 'visual' stereotypes in the generated images of the identity group by grounding the evaluations in existing textual stereotype benchmarks. Yellow boxes denote annotated visual markers of known stereotypes associated with the identity group in the image. We use Stable Diffusion rombach2022high to generate images and evaluate them using the stereotypes present in the SeeGULL dataset jha-etal-2023-seegull.
Figure 2: Global distribution of visual stereotypes across countries. Depth of the color indicates the number of visual stereotypes. A few examples of visual stereotypes of some countries are shown in the figure.
Figure 3: Our approach makes a distinction between "visual" and "non-visual" stereotypes in images. We identify only explicitly present visual stereotypes in the generated images of the identity group.
Figure 4: 'Stereotypical Pull': The generative models have a tendency to 'pull' the generation of images towards an already known stereotype even when prompted otherwise. The red lines indicate 'stereotypical' attributes; the blue lines indicates 'non-stereotypical attributes'. The numbers indicate the mean cosine similarity score between sets of image embeddings.
Figure 5: 'Stereotypical Pull' observed across different identity groups. Y-axis is the similarity $S(\cdot)$ between stereotyped (s) and non-stereotyped (ns) images ($S\text{(}s, ns))$. X-axis represents the difference in the deviations of the stereotypical (s) and the non-stereotypical (ns) images from the default (d) representations $(S\text{(}d,s)-S\text{(}d,ns))$.
...and 3 more figures

ViSAGe: A Global-Scale Analysis of Visual Stereotypes in Text-to-Image Generation

TL;DR

Abstract

ViSAGe: A Global-Scale Analysis of Visual Stereotypes in Text-to-Image Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)