Table of Contents
Fetching ...

Large Language Models Develop Novel Social Biases Through Adaptive Exploration

Addison J. Wu, Ryan Liu, Xuechunzi Bai, Thomas L. Griffiths

TL;DR

This work shows that large language models can develop novel social biases through exploration in sequential decision tasks, even when groups differ by no actual traits. Using a psychology-inspired hiring paradigm, the authors quantify emergent bias via three metrics—Stratification Index $\mathrm{SI}$, Between-Group Divergence $\mathrm{BGD}$, and Group Assignment Stochasticity $\mathrm{GASI}$—across a broad set of frontier models, finding that newer models stratify more than humans. Interventions spanning system prompts, problem structure, and explicit diversity objectives reveal that while CoT and environment tweaks have limited impact, prompting explicit diversity goals provides the most robust mitigation. The findings highlight that LLMs can shape and reinforce social disparities over time, underscoring the need for carefully designed, multifaceted objectives to guide AI systems toward fairer outcomes in decision-making contexts.

Abstract

As large language models (LLMs) are adopted into frameworks that grant them the capacity to make real decisions, it is increasingly important to ensure that they are unbiased. In this paper, we argue that the predominant approach of simply removing existing biases from models is not enough. Using a paradigm from the psychology literature, we demonstrate that LLMs can spontaneously develop novel social biases about artificial demographic groups even when no inherent differences exist. These biases result in highly stratified task allocations, which are less fair than assignments by human participants and are exacerbated by newer and larger models. In social science, emergent biases like these have been shown to result from exploration-exploitation trade-offs, where the decision-maker explores too little, allowing early observations to strongly influence impressions about entire demographic groups. To alleviate this effect, we examine a series of interventions targeting model inputs, problem structure, and explicit steering. We find that explicitly incentivizing exploration most robustly reduces stratification, highlighting the need for better multifaceted objectives to mitigate bias. These results reveal that LLMs are not merely passive mirrors of human social biases, but can actively create new ones from experience, raising urgent questions about how these systems will shape societies over time.

Large Language Models Develop Novel Social Biases Through Adaptive Exploration

TL;DR

This work shows that large language models can develop novel social biases through exploration in sequential decision tasks, even when groups differ by no actual traits. Using a psychology-inspired hiring paradigm, the authors quantify emergent bias via three metrics—Stratification Index , Between-Group Divergence , and Group Assignment Stochasticity —across a broad set of frontier models, finding that newer models stratify more than humans. Interventions spanning system prompts, problem structure, and explicit diversity objectives reveal that while CoT and environment tweaks have limited impact, prompting explicit diversity goals provides the most robust mitigation. The findings highlight that LLMs can shape and reinforce social disparities over time, underscoring the need for carefully designed, multifaceted objectives to guide AI systems toward fairer outcomes in decision-making contexts.

Abstract

As large language models (LLMs) are adopted into frameworks that grant them the capacity to make real decisions, it is increasingly important to ensure that they are unbiased. In this paper, we argue that the predominant approach of simply removing existing biases from models is not enough. Using a paradigm from the psychology literature, we demonstrate that LLMs can spontaneously develop novel social biases about artificial demographic groups even when no inherent differences exist. These biases result in highly stratified task allocations, which are less fair than assignments by human participants and are exacerbated by newer and larger models. In social science, emergent biases like these have been shown to result from exploration-exploitation trade-offs, where the decision-maker explores too little, allowing early observations to strongly influence impressions about entire demographic groups. To alleviate this effect, we examine a series of interventions targeting model inputs, problem structure, and explicit steering. We find that explicitly incentivizing exploration most robustly reduces stratification, highlighting the need for better multifaceted objectives to mitigate bias. These results reveal that LLMs are not merely passive mirrors of human social biases, but can actively create new ones from experience, raising urgent questions about how these systems will shape societies over time.

Paper Structure

This paper contains 59 sections, 1 theorem, 17 equations, 13 figures, 5 tables.

Key Result

Lemma 1

Let $G$ be a random variable for demographic group, $J$ for job class, and $R$ for run of the experiment. Assume that: Define the Stratification Index (SI) as where $U_J$ is the uniform distribution on $\mathcal{J}$ and $H(\cdot)$ is the Shannon entropy (with log base 2), then i.e., SI equals the expected mutual information between $G$ and $J$ across runs. In particular, in a single-run (when $

Figures (13)

  • Figure 1: An illustration of the sequential hiring paradigm bai_costly_2025 we adapt to test LLMs.
  • Figure 2: Frontier models (dots and squares) stratify by demographic more than human participants (dashed lines) across SI and BGD in the hiring paradigm. CoT marginally reduces this stratification.
  • Figure 3: Across model families, stratification increases with newer and larger models.
  • Figure 4: LLMs that are fairer according to the BBQ benchmark parrish_bbq_2022 are instead more susceptible to emergent biases, and make decisions that lead to worse stratification.
  • Figure 5: Lowering underlying success probabilities reduced stratification, especially with CoT---but this was not equally effective across models. Using realistic probabilities weakened this effect.
  • ...and 8 more figures

Theorems & Definitions (2)

  • Lemma 1: Equivalence of SI and MI under uniform job category marginals
  • proof