Measuring Implicit Bias in Explicitly Unbiased Large Language Models

Xuechunzi Bai; Angelina Wang; Ilia Sucholutsky; Thomas L. Griffiths

Measuring Implicit Bias in Explicitly Unbiased Large Language Models

Xuechunzi Bai, Angelina Wang, Ilia Sucholutsky, Thomas L. Griffiths

TL;DR

This work shows that explicitly unbiased large language models can harbor substantial implicit stereotypes. By adapting the Implicit Association Test into LLM Implicit Bias and pairing it with LLM Decision Bias tasks that use relative judgments, the authors reveal widespread biases across multiple social categories and stereotypes. The prompt-based approach correlates with embedding-based bias yet better predicts discriminatory decisions, illustrating a valuable diagnostic tool for assessing bias in proprietary LLMs. The findings underscore the value of psychology-inspired metrics for safety, governance, and mitigation of bias in real-world AI systems.

Abstract

Large language models (LLMs) can pass explicit social bias tests but still harbor implicit biases, similar to humans who endorse egalitarian beliefs yet exhibit subtle biases. Measuring such implicit biases can be a challenge: as LLMs become increasingly proprietary, it may not be possible to access their embeddings and apply existing bias measures; furthermore, implicit biases are primarily a concern if they affect the actual decisions that these systems make. We address both challenges by introducing two new measures of bias: LLM Implicit Bias, a prompt-based method for revealing implicit bias; and LLM Decision Bias, a strategy to detect subtle discrimination in decision-making tasks. Both measures are based on psychological research: LLM Implicit Bias adapts the Implicit Association Test, widely used to study the automatic associations between concepts held in human minds; and LLM Decision Bias operationalizes psychological results indicating that relative evaluations between two candidates, not absolute evaluations assessing each independently, are more diagnostic of implicit biases. Using these measures, we found pervasive stereotype biases mirroring those in society in 8 value-aligned models across 4 social categories (race, gender, religion, health) in 21 stereotypes (such as race and criminality, race and weapons, gender and science, age and negativity). Our prompt-based LLM Implicit Bias measure correlates with existing language model embedding-based bias methods, but better predicts downstream behaviors measured by LLM Decision Bias. These new prompt-based measures draw from psychology's long history of research into measuring stereotype biases based on purely observable behavior; they expose nuanced biases in proprietary value-aligned LLMs that appear unbiased according to standard benchmarks.

Measuring Implicit Bias in Explicitly Unbiased Large Language Models

TL;DR

Abstract

Paper Structure (23 sections, 1 equation, 10 figures, 16 tables)

This paper contains 23 sections, 1 equation, 10 figures, 16 tables.

Introduction
Method
LLM Implicit Bias
LLM Decision Bias
Results
Uncovering LLM Implicit Bias
Uncovering LLM Decision Bias
Understanding Properties of LLM Implicit Bias
Related Work and Limitations
Conclusions
GPT-4 on Existing Bias Benchmarks
GPT-4 Moderation on self-generated Implicit and Decision Responses
Results for LLM Implicit Bias
Results for LLM Decision Bias
Prompts for LLM Implicit Bias
...and 8 more sections

Figures (10)

Figure 1: Example of implicit bias and decision bias in explicitly unbiased LLMs.
Figure 2: LLM Implicit Bias: Results showing LLM Implicit Bias scores on the vertical axis, for 21 stereotypes on the horizontal axis, in 4 social categories coded in 4 colors, across 8 LLMs in 8 panels. Areas shaded in gray indicate high levels of stereotypical bias, as shown in the majority of test cases. Red dotted horizontal lines indicate unbiased responses. Error bars represent 95% bootstrapped confidence intervals. See statistical analyses in the main text and tables in Appendix.
Figure 3: LLM Decision Bias: Results showing LLM Decision Bias scores on the vertical axis, for 21 stereotypes on the horizontal axis, in 4 social categories coded in 4 colors, across 8 LLMs in 8 panels. Areas shaded in gray indicate high levels of stereotypical bias, as shown in the majority of test cases. Red dotted horizontal lines indicate unbiased responses. Error bars represent 95% bootstrapped confidence intervals. See statistical analyses in the main text and tables in Appendix.
Figure 4: Scaling Analysis: Results showing LLM Implicit Bias (left), Decision Bias (middle), and Rejection Rate (right) sorted by approximately increasing model size. Implicit biases increase with model size, but not decision bias or rejection rate. Details in the main text.
Figure 5: LLM Implicit Bias vs. Embedding Bias Predicting Decision Bias: The top panels show how prompt-based LLM Implicit Bias predicts the binary decisions, whereas the bottom panels show how embedding bias predicts these decisions, for each social domain. The model fit is shown in the foreground with 95% confidence interval, and the raw data are plotted in the background.
...and 5 more figures

Measuring Implicit Bias in Explicitly Unbiased Large Language Models

TL;DR

Abstract

Measuring Implicit Bias in Explicitly Unbiased Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)