Into the LAIONs Den: Investigating Hate in Multimodal Datasets

Abeba Birhane; Vinay Prabhu; Sang Han; Vishnu Naresh Boddeti; Alexandra Sasha Luccioni

Into the LAIONs Den: Investigating Hate in Multimodal Datasets

Abeba Birhane, Vinay Prabhu, Sang Han, Vishnu Naresh Boddeti, Alexandra Sasha Luccioni

TL;DR

This work debunks the assumption that increasing dataset size inherently improves safety by showing that hate-related content in alt-text rises when scaling LAION-400M to LAION-2B-en. Using pysentimiento to quantify hateful, targeted, and aggressive speech, the authors define Hate Content Rate (HCR) and demonstrate a consistent, measurable increase in harmful text accompanying images as data scale expands. They also reveal a partial, weak correlation between image NSFW signals and harmful alt-text, and show that NSFW-only filtering fails to remove all toxic content, underscoring the need for robust, multimodal data curation and independent audits. The paper argues for transparent metrics, open data access, and governance to ensure safer deployment of multimodal models trained on large web-sourced datasets.

Abstract

'Scale the model, scale the data, scale the compute' is the reigning sentiment in the world of generative AI today. While the impact of model scaling has been extensively studied, we are only beginning to scratch the surface of data scaling and its consequences. This is especially of critical importance in the context of vision-language datasets such as LAION. These datasets are continually growing in size and are built based on large-scale internet dumps such as the Common Crawl, which is known to have numerous drawbacks ranging from quality, legality, and content. The datasets then serve as the backbone for large generative models, contributing to the operationalization and perpetuation of harmful societal and historical biases and stereotypes. In this paper, we investigate the effect of scaling datasets on hateful content through a comparative audit of two datasets: LAION-400M and LAION-2B. Our results show that hate content increased by nearly 12% with dataset scale, measured both qualitatively and quantitatively using a metric that we term as Hate Content Rate (HCR). We also found that filtering dataset contents based on Not Safe For Work (NSFW) values calculated based on images alone does not exclude all the harmful content in alt-text. Instead, we found that trace amounts of hateful, targeted, and aggressive text remain even when carrying out conservative filtering. We end with a reflection and a discussion of the significance of our results for dataset curation and usage in the AI community. Code and the meta-data assets curated in this paper are publicly available at https://github.com/vinayprabhu/hate_scaling. Content warning: This paper contains examples of hateful text that might be disturbing, distressing, and/or offensive.

Into the LAIONs Den: Investigating Hate in Multimodal Datasets

TL;DR

Abstract

Paper Structure (15 sections, 3 equations, 5 figures, 2 tables)

This paper contains 15 sections, 3 equations, 5 figures, 2 tables.

Introduction
Scaling Datasets: An Overview
Dataset Audit: LAION-400M and LAION-2B-en
Audit methodology
Experiment design
Scaling is not benign: comparing LAION 400M and LAION 2B-en
Intra-dataset filewise comparisons
Connecting toxic alt-text and NSFW labels
Discussion and Recommendations
Conclusion
The risks of extrapolation
NSFW Analysis
The origins of the dataset scaling laws: A cartoon sketch emerges
Blackbox non-reproducible empirical results
The tactical template: Fuzzy main section meets non-existent appendices

Figures (5)

Figure 1: HCR curves for the LAION400M and LAION-2B-en datasets using pysentimiento outputs showing that Hate Content Rate increased with dataset size.
Figure 2: Fused swarm-box-violinplot that captures the file-wise HCR metrics for all the 160 (=32+128) parquet files from LAION400M and LAION-2B-en. HCRs for LAION-2B-en (the red swarms) are higher than the 32 file-level HCRs for the LAION400M (the blue swarms) for all three sub-categories -- hateful, targeted, and aggressive speech.
Figure 3: Results from the two-sample t-test while correcting for unequal variances (using the Welch separate variances T-test). 'BF10' indicates the Bayes Factor of the alternative hypothesis. For all three categories of hateful, targeted, and aggressive speech, the file-wise HCR associated with the 2B-en dataset is higher than the file-wise HCR for the 400M dataset, showing dataset degradation with dataset scaling.
Figure A1: Binomial proportion confidence interval (CI) analysis to establish the extent of HCR underestimation upon using LAION400M statistics.
Figure A2: The Google template used to (non)declare the training dataset information along with paper screenshots

Into the LAIONs Den: Investigating Hate in Multimodal Datasets

TL;DR

Abstract

Into the LAIONs Den: Investigating Hate in Multimodal Datasets

Authors

TL;DR

Abstract

Table of Contents

Figures (5)