Table of Contents
Fetching ...

Inference-Time Reasoning Selectively Reduces Implicit Social Bias in Large Language Models

Molly Apsel, Michael N. Jones

TL;DR

This work investigates whether enabling inference-time reasoning in large language models modulates implicit social bias. Across two experiments, the authors reuse an IAT-inspired Word Association Test to assess implicit biases, comparing standard inference against reasoning-enabled inference in multiple model families. Experiment 1 finds that reasoning reduces implicit bias for several social topics, with a significant aggregate effect ($Δ = 0.189$, $p < 0.0001$) but substantial heterogeneity across models and topics; Experiment 2 shows no such reduction for non-social semantic prosody stimuli ($p = 0.31$), suggesting domain-specific effects. The results imply that reasoning-enabled inference can influence fairness evaluation outcomes for certain models and biases, while underscoring the need to interpret bias measures carefully and to explore how alignment interacts with reasoning across architectures.

Abstract

Drawing on constructs from psychology, prior work has identified a distinction between explicit and implicit bias in large language models (LLMs). While many LLMs undergo post-training alignment and safety procedures to avoid expressions of explicit social bias, they still exhibit significant implicit biases on indirect tasks resembling the Implicit Association Test (IAT). Recent work has further shown that inference-time reasoning can impair LLM performance on tasks that rely on implicit statistical learning. Motivated by a theoretical link between implicit associations and statistical learning in human cognition, we examine how reasoning-enabled inference affects implicit bias in LLMs. We find that enabling reasoning significantly reduces measured implicit bias on an IAT-style evaluation for some model classes across fifteen stereotype topics. This effect appears specific to social bias domains, as we observe no corresponding reduction for non-social implicit associations. As reasoning is increasingly enabled by default in deployed LLMs, these findings suggest that it can meaningfully alter fairness evaluation outcomes in some systems, while also raising questions about how alignment procedures interact with inference-time reasoning to drive variation in bias reduction across model types. More broadly, this work highlights how theory from cognitive science and psychology can complement AI evaluation research by providing methodological and interpretive frameworks that reveal new insights into model behavior.

Inference-Time Reasoning Selectively Reduces Implicit Social Bias in Large Language Models

TL;DR

This work investigates whether enabling inference-time reasoning in large language models modulates implicit social bias. Across two experiments, the authors reuse an IAT-inspired Word Association Test to assess implicit biases, comparing standard inference against reasoning-enabled inference in multiple model families. Experiment 1 finds that reasoning reduces implicit bias for several social topics, with a significant aggregate effect (, ) but substantial heterogeneity across models and topics; Experiment 2 shows no such reduction for non-social semantic prosody stimuli (), suggesting domain-specific effects. The results imply that reasoning-enabled inference can influence fairness evaluation outcomes for certain models and biases, while underscoring the need to interpret bias measures carefully and to explore how alignment interacts with reasoning across architectures.

Abstract

Drawing on constructs from psychology, prior work has identified a distinction between explicit and implicit bias in large language models (LLMs). While many LLMs undergo post-training alignment and safety procedures to avoid expressions of explicit social bias, they still exhibit significant implicit biases on indirect tasks resembling the Implicit Association Test (IAT). Recent work has further shown that inference-time reasoning can impair LLM performance on tasks that rely on implicit statistical learning. Motivated by a theoretical link between implicit associations and statistical learning in human cognition, we examine how reasoning-enabled inference affects implicit bias in LLMs. We find that enabling reasoning significantly reduces measured implicit bias on an IAT-style evaluation for some model classes across fifteen stereotype topics. This effect appears specific to social bias domains, as we observe no corresponding reduction for non-social implicit associations. As reasoning is increasingly enabled by default in deployed LLMs, these findings suggest that it can meaningfully alter fairness evaluation outcomes in some systems, while also raising questions about how alignment procedures interact with inference-time reasoning to drive variation in bias reduction across model types. More broadly, this work highlights how theory from cognitive science and psychology can complement AI evaluation research by providing methodological and interpretive frameworks that reveal new insights into model behavior.
Paper Structure (30 sections, 2 figures, 5 tables)

This paper contains 30 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: LLM Word Association Test scores measuring implicit bias in models with standard and reasoning-enabled inference. Vertical lines illustrate the magnitude of the difference between conditions, and stars indicate a significant difference based on independent samples t-tests ($p<.05$). The stereotypes tested, along the x-axis, fall into four social domains (race, gender, religion, health) and are coded accordingly in four colors. The bias scores, along the y-axis, range from -1 to 1, with greater positive values indicating greater stereotypical bias, 0 indicating unbiased responses, and negative values indicating counterstereotypical bias. The results show that reasoning-enabled inference strongly reduces bias scores from some models and topics, while other models show little or no change.
  • Figure 2: LLM Word Association Test scores for non-social semantic prosody stimuli under standard and reasoning-enabled inference. Unlike social bias domains (Figure \ref{['fig:mainplot']}), no model type shows a reliable difference between conditions.