Table of Contents
Fetching ...

Biases in the Blind Spot: Detecting What LLMs Fail to Mention

Iván Arcuschin, David Chanin, Adrià Garriga-Alonso, Oana-Maria Camburu

TL;DR

This work tackles the problem of unverbalized biases in LLM decision-making, where influential factors are not cited in chain-of-thought reasoning. It introduces a fully automated, black-box pipeline that automatically hypothesizes bias concepts with LLM autoraters, tests them via controlled input variations, and uses McNemar's test with Bonferroni correction and early stopping to detect statistically significant, unverbalized biases. The method is evaluated on three decision tasks (hiring, loan approval, university admissions) across six models, uncovering both known biases (gender, race) and novel ones (language proficiency, formality) while providing verifiable verbalization estimates. Across cross-task analyses and ablations, the pipeline demonstrates robustness, generalizability to prior bias studies, and meaningful practical impact, offering a scalable path to automatic task-specific bias discovery beyond traditional CoT-based monitoring.

Abstract

Large Language Models (LLMs) often provide chain-of-thought (CoT) reasoning traces that appear plausible, but may hide internal biases. We call these *unverbalized biases*. Monitoring models via their stated reasoning is therefore unreliable, and existing bias evaluations typically require predefined categories and hand-crafted datasets. In this work, we introduce a fully automated, black-box pipeline for detecting task-specific unverbalized biases. Given a task dataset, the pipeline uses LLM autoraters to generate candidate bias concepts. It then tests each concept on progressively larger input samples by generating positive and negative variations, and applies statistical techniques for multiple testing and early stopping. A concept is flagged as an unverbalized bias if it yields statistically significant performance differences while not being cited as justification in the model's CoTs. We evaluate our pipeline across six LLMs on three decision tasks (hiring, loan approval, and university admissions). Our technique automatically discovers previously unknown biases in these models (e.g., Spanish fluency, English proficiency, writing formality). In the same run, the pipeline also validates biases that were manually identified by prior work (gender, race, religion, ethnicity). More broadly, our proposed approach provides a practical, scalable path to automatic task-specific bias discovery.

Biases in the Blind Spot: Detecting What LLMs Fail to Mention

TL;DR

This work tackles the problem of unverbalized biases in LLM decision-making, where influential factors are not cited in chain-of-thought reasoning. It introduces a fully automated, black-box pipeline that automatically hypothesizes bias concepts with LLM autoraters, tests them via controlled input variations, and uses McNemar's test with Bonferroni correction and early stopping to detect statistically significant, unverbalized biases. The method is evaluated on three decision tasks (hiring, loan approval, university admissions) across six models, uncovering both known biases (gender, race) and novel ones (language proficiency, formality) while providing verifiable verbalization estimates. Across cross-task analyses and ablations, the pipeline demonstrates robustness, generalizability to prior bias studies, and meaningful practical impact, offering a scalable path to automatic task-specific bias discovery beyond traditional CoT-based monitoring.

Abstract

Large Language Models (LLMs) often provide chain-of-thought (CoT) reasoning traces that appear plausible, but may hide internal biases. We call these *unverbalized biases*. Monitoring models via their stated reasoning is therefore unreliable, and existing bias evaluations typically require predefined categories and hand-crafted datasets. In this work, we introduce a fully automated, black-box pipeline for detecting task-specific unverbalized biases. Given a task dataset, the pipeline uses LLM autoraters to generate candidate bias concepts. It then tests each concept on progressively larger input samples by generating positive and negative variations, and applies statistical techniques for multiple testing and early stopping. A concept is flagged as an unverbalized bias if it yields statistically significant performance differences while not being cited as justification in the model's CoTs. We evaluate our pipeline across six LLMs on three decision tasks (hiring, loan approval, and university admissions). Our technique automatically discovers previously unknown biases in these models (e.g., Spanish fluency, English proficiency, writing formality). In the same run, the pipeline also validates biases that were manually identified by prior work (gender, race, religion, ethnicity). More broadly, our proposed approach provides a practical, scalable path to automatic task-specific bias discovery.
Paper Structure (152 sections, 4 figures, 21 tables, 1 algorithm)

This paper contains 152 sections, 4 figures, 21 tables, 1 algorithm.

Figures (4)

  • Figure 1: Example of an unverbalized bias detected in Claude Sonnet 4 on the loan approval task. Adding a single sentence about religious affiliation changes the model's decision, despite the financial details being identical. In this example, the model never cites religion as a factor in its reasoning, instead constructing different framings of the same debt-to-income ratio: as an insurmountable concern (left) versus as outweighed by creditworthiness (right). This bias has an effect size of $0.037$in favor of minority-religion applicants ($p = 9.15 \times 10^{-7}$ over $2{,}500$ inputs), meaning they are approved $3.7$ percentage points more often. The concept was verbalized in only $\mathbf{12.4\%}$ of responses where the decision flipped, well below our $30\%$ threshold.
  • Figure 2: Consistency of concept detection across five random seeds on the loan approval dataset with Gemma-3-12b-it.
  • Figure 3: Intentional bias injection study results on Gemma-3-27B-it. Secret biases should be detected as unverbalized; overt biases should be filtered via verbalization detection.
  • Figure 4: Per-concept accuracy in the bias ablation study. Each concept is tested in $4$ configurations. Green indicates $\geq$$75$% accuracy, orange indicates $50$-$75$%, and red indicates $<$$50$%.

Theorems & Definitions (1)

  • Definition 2.1: Unverbalized Bias