Table of Contents
Fetching ...

Evaluating Implicit Biases in LLM Reasoning through Logic Grid Puzzles

Fatima Jahara, Mark Dredze, Sharon Levy

TL;DR

This work introduces PRIME, a puzzle-based framework to diagnose implicit biases in LLM reasoning by solving logic-grid puzzles. It constructs triplets of puzzles (Generic, Stereotypical, Anti-stereotypical) across sizes from $2 \times 3$ to $4 \times 4$, enabling automatic puzzle generation, ground-truth verification, and controlled comparisons. The framework defines two core metrics, Edit Distance and Bias Difference, to quantify deductive performance and bias influence across bias-probing and general dimensions, and evaluates a diverse set of models (open and closed) under explicit and implicit bias settings. Key findings show that models reason more accurately on stereotype-aligned puzzles, with CoT prompting consistently mitigating bias more reliably than static debiasing, highlighting gaps in current alignment and safety strategies. PRIME thus offers a scalable, formal procedure to diagnose, quantify, and potentially mitigate implicit biases embedded in LLM reasoning, with implications for fairness in decision-making tasks.

Abstract

While recent safety guardrails effectively suppress overtly biased outputs, subtler forms of social bias emerge during complex logical reasoning tasks that evade current evaluation benchmarks. To fill this gap, we introduce a new evaluation framework, PRIME (Puzzle Reasoning for Implicit Biases in Model Evaluation), that uses logic grid puzzles to systematically probe the influence of social stereotypes on logical reasoning and decision making in LLMs. Our use of logic puzzles enables automatic generation and verification, as well as variability in complexity and biased settings. PRIME includes stereotypical, anti-stereotypical, and neutral puzzle variants generated from a shared puzzle structure, allowing for controlled and fine-grained comparisons. We evaluate multiple model families across puzzle sizes and test the effectiveness of prompt-based mitigation strategies. Focusing our experiments on gender stereotypes, our findings highlight that models consistently reason more accurately when solutions align with stereotypical associations. This demonstrates the significance of PRIME for diagnosing and quantifying social biases perpetuated in the deductive reasoning of LLMs, where fairness is critical.

Evaluating Implicit Biases in LLM Reasoning through Logic Grid Puzzles

TL;DR

This work introduces PRIME, a puzzle-based framework to diagnose implicit biases in LLM reasoning by solving logic-grid puzzles. It constructs triplets of puzzles (Generic, Stereotypical, Anti-stereotypical) across sizes from to , enabling automatic puzzle generation, ground-truth verification, and controlled comparisons. The framework defines two core metrics, Edit Distance and Bias Difference, to quantify deductive performance and bias influence across bias-probing and general dimensions, and evaluates a diverse set of models (open and closed) under explicit and implicit bias settings. Key findings show that models reason more accurately on stereotype-aligned puzzles, with CoT prompting consistently mitigating bias more reliably than static debiasing, highlighting gaps in current alignment and safety strategies. PRIME thus offers a scalable, formal procedure to diagnose, quantify, and potentially mitigate implicit biases embedded in LLM reasoning, with implications for fairness in decision-making tasks.

Abstract

While recent safety guardrails effectively suppress overtly biased outputs, subtler forms of social bias emerge during complex logical reasoning tasks that evade current evaluation benchmarks. To fill this gap, we introduce a new evaluation framework, PRIME (Puzzle Reasoning for Implicit Biases in Model Evaluation), that uses logic grid puzzles to systematically probe the influence of social stereotypes on logical reasoning and decision making in LLMs. Our use of logic puzzles enables automatic generation and verification, as well as variability in complexity and biased settings. PRIME includes stereotypical, anti-stereotypical, and neutral puzzle variants generated from a shared puzzle structure, allowing for controlled and fine-grained comparisons. We evaluate multiple model families across puzzle sizes and test the effectiveness of prompt-based mitigation strategies. Focusing our experiments on gender stereotypes, our findings highlight that models consistently reason more accurately when solutions align with stereotypical associations. This demonstrates the significance of PRIME for diagnosing and quantifying social biases perpetuated in the deductive reasoning of LLMs, where fairness is critical.

Paper Structure

This paper contains 47 sections, 7 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Example of an explicit bias evaluation using the question answering task (left) and a corresponding implicit bias evaluation using Stereotypical (top) and Anti-stereotypical (bottom) logic puzzles from PRIME (right). The LLM here relies on social stereotypes in solving the Bias-probing ("Occupation") category.
  • Figure 2: Bias–Correctness plot of errors in the bias-probing column $c_2$ for $S$ and $AS$ puzzles, averaged over 4$\times$3 and 4$\times$4 puzzle sizes with ${\mathcal{ED}_{\mathcal{BP}}} > 0$. The x-axis shows the normalized correctness score, while the y-axis shows the normalized bias score. Positive clustering along the y-axis indicates stereotypical errors, and negative clustering indicates anti-stereotypical errors.
  • Figure 3: Clue Types
  • Figure 4: Example of a Generic ($G$), Stereotypical ($S$), and Anti-stereotypical ($AS$) puzzle setup in PRIME.
  • Figure 5: Prompt template used for clue generation.
  • ...and 10 more figures