The Dark Side of Dataset Scaling: Evaluating Racial Classification in Multimodal Models

Abeba Birhane; Sepehr Dehdashtian; Vinay Uday Prabhu; Vishnu Boddeti

The Dark Side of Dataset Scaling: Evaluating Racial Classification in Multimodal Models

Abeba Birhane, Sepehr Dehdashtian, Vinay Uday Prabhu, Vishnu Boddeti

TL;DR

This work audits 14 OpenCLIP Vision Transformer–based visio-linguistic models trained on LAION-400M and LAION-2B-en, using the Chicago Face Dataset to probe racial and gender bias under dataset-scale changes. It reveals that larger models (e.g., ViT-L-14) increasingly mislabel Black and Latino male faces as 'criminal' as data scale grows, while smaller models show opposite trends; non-human offensive labels remain rare. The study discusses ethical implications, data-curation practices, and mitigation strategies, arguing for open access and rigorous audits to prevent amplification of harmful societal stereotypes in large-scale multimodal systems. It also outlines future directions, including extending analyses to other models, refining prompts, and addressing data-leakage and bias-related risks in dataset curation and deployment.

Abstract

Scale the model, scale the data, scale the GPU farms is the reigning sentiment in the world of generative AI today. While model scaling has been extensively studied, data scaling and its downstream impacts on model performance remain under-explored. This is particularly important in the context of multimodal datasets whose main source is the World Wide Web, condensed and packaged as the Common Crawl dump, which is known to exhibit numerous drawbacks. In this paper, we evaluate the downstream impact of dataset scaling on 14 visio-linguistic models (VLMs) trained on the LAION400-M and LAION-2B datasets by measuring racial and gender bias using the Chicago Face Dataset (CFD) as the probe. Our results show that as the training data increased, the probability of a pre-trained CLIP model misclassifying human images as offensive non-human classes such as chimpanzee, gorilla, and orangutan decreased, but misclassifying the same images as human offensive classes such as criminal increased. Furthermore, of the 14 Vision Transformer-based VLMs we evaluated, the probability of predicting an image of a Black man and a Latino man as criminal increases by 65% and 69%, respectively, when the dataset is scaled from 400M to 2B samples for the larger ViT-L models. Conversely, for the smaller base ViT-B models, the probability of predicting an image of a Black man and a Latino man as criminal decreases by 20% and 47%, respectively, when the dataset is scaled from 400M to 2B samples. We ground the model audit results in a qualitative and historical analysis, reflect on our findings and their implications for dataset curation practice, and close with a summary of mitigation mechanisms and ways forward. Content warning: This article contains racially dehumanising and offensive descriptions.

The Dark Side of Dataset Scaling: Evaluating Racial Classification in Multimodal Models

TL;DR

Abstract

Paper Structure (16 sections, 2 equations, 9 figures, 1 table)

This paper contains 16 sections, 2 equations, 9 figures, 1 table.

Introduction
Audit Methodology
Experiment Design
Results
Qualitative Analysis: Dehumanization and Criminalization of Black Bodies
Discussion and Recommendations
Future Work and Conclusion
Conclusion
Effect of cosine similarity metric during dataset curation
CLIP-like models suffering from Concept Association Bias (CAB)
Bags-Of-Words like behavior
Data-leakage and Identity Inference Attacks (IDIA)
Additional methodological details
Self-similarity matrix of CFD extracted featured
Randomly selected, hand-blurred samples from the CFD
...and 1 more sections

Figures (9)

Figure 1: Heatmaps of 597 × 8 softmax-matrices for three models (columns) and two pre-training datasets (rows).
Figure 2: Effect of scaling the dataset from 400M to 2B on the frequency of an image from CFD getting predicted as 'criminal' for each race-gender group and three different architectures: ViT-B-16 (a), ViT-B-32 (b), and ViT-L-14 (c). We observe that the larger ViT-L model's predilection for labeling faces as 'criminals' increases significantly for black and Latino men when the pre-training dataset is scaled from 400M to 2B (see Section \ref{['sec:cfd_results']}, specifically 3.1 to 3.3).
Figure 3: Plots showing the effect of patch size on the distribution of "criminal" predictions for (a) LAION-400M and (b) LAION-2B as the pre-training datasets.
Figure 4: Frequency of 'criminal' prediction versus patch size for the LAION-400M (a) and for the LAION-2B (b) datasets.
Figure 5: The effect of dataset scaling on the predictions of the models. The numbers show the change in probabilities when the pre-trained dataset is scaled from 400M to 2B for ViT-B-16, ViT-B-32, and ViT-l-14. Positive values mean an increase in the probability when the number of pre-training samples is increased. All values are in percentage.
...and 4 more figures

The Dark Side of Dataset Scaling: Evaluating Racial Classification in Multimodal Models

TL;DR

Abstract

The Dark Side of Dataset Scaling: Evaluating Racial Classification in Multimodal Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)