Dataset Scale and Societal Consistency Mediate Facial Impression Bias in Vision-Language AI

Robert Wolfe; Aayushi Dangol; Alexis Hiniker; Bill Howe

Dataset Scale and Societal Consistency Mediate Facial Impression Bias in Vision-Language AI

Robert Wolfe, Aayushi Dangol, Alexis Hiniker, Bill Howe

TL;DR

This paper investigates whether vision–language models, specifically 43 pretrained CLIP variants, learn human-like facial impression biases and how dataset scale and societal consensus modulate these biases. Using the One Million Impressions (OMI) dataset as human-ground-truth and a suite of prompt-based embeddings, the authors quantify model–human bias similarity s^a_m via Spearman correlation and show that biases align with Human IRR and grow with data scale, particularly for unobservable traits. They extend the analysis to generative models by projecting CLIP-derived bias subspaces into text-to-image outputs, revealing that SDXL propagates similar biases and exhibits White–Black attribute differences. The findings highlight that large, uncurated training data contribute to emergent, societally biased representations, underscoring the need for careful dataset curation in zero-shot deployments and offering a computational social science lens on bias. The work also demonstrates how synthetic data tools and subspace methods can be leveraged to audit and study bias across multimodal AI systems. $^{1}$

Abstract

Multimodal AI models capable of associating images and text hold promise for numerous domains, ranging from automated image captioning to accessibility applications for blind and low-vision users. However, uncertainty about bias has in some cases limited their adoption and availability. In the present work, we study 43 CLIP vision-language models to determine whether they learn human-like facial impression biases, and we find evidence that such biases are reflected across three distinct CLIP model families. We show for the first time that the the degree to which a bias is shared across a society predicts the degree to which it is reflected in a CLIP model. Human-like impressions of visually unobservable attributes, like trustworthiness and sexuality, emerge only in models trained on the largest dataset, indicating that a better fit to uncurated cultural data results in the reproduction of increasingly subtle social biases. Moreover, we use a hierarchical clustering approach to show that dataset size predicts the extent to which the underlying structure of facial impression bias resembles that of facial impression bias in humans. Finally, we show that Stable Diffusion models employing CLIP as a text encoder learn facial impression biases, and that these biases intersect with racial biases in Stable Diffusion XL-Turbo. While pretrained CLIP models may prove useful for scientific studies of bias, they will also require significant dataset curation when intended for use as general-purpose models in a zero-shot setting.

Dataset Scale and Societal Consistency Mediate Facial Impression Bias in Vision-Language AI

TL;DR

Abstract

Paper Structure (48 sections, 10 equations, 9 figures, 4 tables)

This paper contains 48 sections, 10 equations, 9 figures, 4 tables.

Introduction
Related Work
Facial Impression Bias
Relationship to Social Group Biases
Computational Models of Facial Impression Bias
CLIP and Vision-Language AI
Text-to-Image Generators
Impact of Scale in Deep Learning and in CLIP
Bias in Vision-Language AI
Synthetic Media for Vision-Language Research
Data
The One Million Impressions Dataset
CLIP Training Data
Pretrained CLIP Models
Pretrained Stable Diffusion Models
...and 33 more sections

Figures (9)

Figure 1: CLIP models learn human-like facial impression biases. The highest model-human correlations are obtained for intuitively visual categories that are broadly shared by a society (such as gender, age, and happiness). Models trained on the largest dataset (LAION-2B) exhibit more human-like biases than FaceCLIP or OpenAI models for most attributes.
Figure 2: Examples from the OMI dataset repository at https://github.com/jcpeterson/omi, used as stimuli in our research.
Figure 3: The similarity of CLIP bias to human bias is strongly correlated with human IRR, indicating that the societal consistency of a bias plays a significant role in whether a model learns it during semi-supervised pretraining.
Figure 4: CLIP models exhibit significant Spearman's $\rho$ between Mean Model-Human Similarity and OMI IRR.
Figure 5: The structure of facial impression biases in CLIP-ViT-L-14 mirrors that of human facial impression biases quantified in the OMI dataset. Clusters related to ethnicity emerge in each, as do clusters grouping gender, sexuality, and smugness.
...and 4 more figures

Dataset Scale and Societal Consistency Mediate Facial Impression Bias in Vision-Language AI

TL;DR

Abstract

Dataset Scale and Societal Consistency Mediate Facial Impression Bias in Vision-Language AI

Authors

TL;DR

Abstract

Table of Contents

Figures (9)