Understanding the Interplay of Scale, Data, and Bias in Language Models: A Case Study with BERT

Muhammad Ali; Swetasudha Panda; Qinlan Shen; Michael Wick; Ari Kobren

Understanding the Interplay of Scale, Data, and Bias in Language Models: A Case Study with BERT

Muhammad Ali, Swetasudha Panda, Qinlan Shen, Michael Wick, Ari Kobren

TL;DR

This study interrogates how model scale interacts with pre-training data to shape social biases in BERT across upstream (pre-training) and downstream (fine-tuned) tasks. Using four BERT sizes trained on CC-100-EN, English Wikipedia, or a mix, the authors measure biases with upstream log-probability and sentiment metrics and downstream false-positive-rate disparities on toxicity classification, complemented by dataset-bias analyses. Key findings show upstream biases tend to increase with size for CC-100-EN and with size for Wikipedia in gender stereotypes, while downstream biases decrease with scale regardless of pre-training data; however, certain identity groups consistently exhibit higher toxicity associations. The work highlights the critical role of pre-training data composition for bias outcomes, suggests that mixing moderated data with large-scale corpora can attenuate some biases, and underscores the need for careful, domain-specific bias metrics when evaluating language models. $FPR_g$ and other bias measures should be interpreted within the broader socio-technical context to ensure responsible deployment.

Abstract

In the current landscape of language model research, larger models, larger datasets and more compute seems to be the only way to advance towards intelligence. While there have been extensive studies of scaling laws and models' scaling behaviors, the effect of scale on a model's social biases and stereotyping tendencies has received less attention. In this study, we explore the influence of model scale and pre-training data on its learnt social biases. We focus on BERT -- an extremely popular language model -- and investigate biases as they show up during language modeling (upstream), as well as during classification applications after fine-tuning (downstream). Our experiments on four architecture sizes of BERT demonstrate that pre-training data substantially influences how upstream biases evolve with model scale. With increasing scale, models pre-trained on large internet scrapes like Common Crawl exhibit higher toxicity, whereas models pre-trained on moderated data sources like Wikipedia show greater gender stereotypes. However, downstream biases generally decrease with increasing model scale, irrespective of the pre-training data. Our results highlight the qualitative role of pre-training data in the biased behavior of language models, an often overlooked aspect in the study of scale. Through a detailed case study of BERT, we shed light on the complex interplay of data and model scale, and investigate how it translates to concrete biases.

Understanding the Interplay of Scale, Data, and Bias in Language Models: A Case Study with BERT

TL;DR

and other bias measures should be interpreted within the broader socio-technical context to ensure responsible deployment.

Abstract

Paper Structure (20 sections, 4 equations, 4 figures, 2 tables)

This paper contains 20 sections, 4 equations, 4 figures, 2 tables.

Introduction
How could scale influence bias?
Our contributions.
Related Work
Methods
Models
Pre-Training Data
Pre-training configuration.
Metrics
Upstream bias metrics.
Downstream bias metrics.
Dataset bias metrics.
Results
Upstream biases can increase with model size
Evolution of bias over the training process.
...and 5 more sections

Figures (4)

Figure 1: Upstream biases for each model size and pre-training data type in terms of our bias metrics: (a) log probability gaps between he/him and she/her pronouns for prompts related to occupations (b) average negative sentiment for masked language modeling completions related to multiple identity groups.
Figure 2: Upstream biases (measured via negative sentiment associations) over the course of pre-training. Models pre-trained on CC-100 generally result in higher bias scores compared to models pre-trained on Wikipedia.
Figure 3: Downstream biases evaluated on toxicity classification data from dixon2018measuring. For each model size and type of pre-training data, false positive rates (FPR) for each identity group are shown. Median FPR and variance of FPRs decreases as models grow larger, but some outliers remain.
Figure 4: Average negative sentiment for sentences in pre-training data that mention our studied identity groups. CC-100-EN (x-axis) almost always encodes more negative sentiment.

Understanding the Interplay of Scale, Data, and Bias in Language Models: A Case Study with BERT

TL;DR

Abstract

Understanding the Interplay of Scale, Data, and Bias in Language Models: A Case Study with BERT

Authors

TL;DR

Abstract

Table of Contents

Figures (4)