Dataset Scale and Societal Consistency Mediate Facial Impression Bias in Vision-Language AI
Robert Wolfe, Aayushi Dangol, Alexis Hiniker, Bill Howe
TL;DR
This paper investigates whether vision–language models, specifically 43 pretrained CLIP variants, learn human-like facial impression biases and how dataset scale and societal consensus modulate these biases. Using the One Million Impressions (OMI) dataset as human-ground-truth and a suite of prompt-based embeddings, the authors quantify model–human bias similarity s^a_m via Spearman correlation and show that biases align with Human IRR and grow with data scale, particularly for unobservable traits. They extend the analysis to generative models by projecting CLIP-derived bias subspaces into text-to-image outputs, revealing that SDXL propagates similar biases and exhibits White–Black attribute differences. The findings highlight that large, uncurated training data contribute to emergent, societally biased representations, underscoring the need for careful dataset curation in zero-shot deployments and offering a computational social science lens on bias. The work also demonstrates how synthetic data tools and subspace methods can be leveraged to audit and study bias across multimodal AI systems. $^{1}$
Abstract
Multimodal AI models capable of associating images and text hold promise for numerous domains, ranging from automated image captioning to accessibility applications for blind and low-vision users. However, uncertainty about bias has in some cases limited their adoption and availability. In the present work, we study 43 CLIP vision-language models to determine whether they learn human-like facial impression biases, and we find evidence that such biases are reflected across three distinct CLIP model families. We show for the first time that the the degree to which a bias is shared across a society predicts the degree to which it is reflected in a CLIP model. Human-like impressions of visually unobservable attributes, like trustworthiness and sexuality, emerge only in models trained on the largest dataset, indicating that a better fit to uncurated cultural data results in the reproduction of increasingly subtle social biases. Moreover, we use a hierarchical clustering approach to show that dataset size predicts the extent to which the underlying structure of facial impression bias resembles that of facial impression bias in humans. Finally, we show that Stable Diffusion models employing CLIP as a text encoder learn facial impression biases, and that these biases intersect with racial biases in Stable Diffusion XL-Turbo. While pretrained CLIP models may prove useful for scientific studies of bias, they will also require significant dataset curation when intended for use as general-purpose models in a zero-shot setting.
