Evaluating Implicit Biases in LLM Reasoning through Logic Grid Puzzles
Fatima Jahara, Mark Dredze, Sharon Levy
TL;DR
This work introduces PRIME, a puzzle-based framework to diagnose implicit biases in LLM reasoning by solving logic-grid puzzles. It constructs triplets of puzzles (Generic, Stereotypical, Anti-stereotypical) across sizes from $2 \times 3$ to $4 \times 4$, enabling automatic puzzle generation, ground-truth verification, and controlled comparisons. The framework defines two core metrics, Edit Distance and Bias Difference, to quantify deductive performance and bias influence across bias-probing and general dimensions, and evaluates a diverse set of models (open and closed) under explicit and implicit bias settings. Key findings show that models reason more accurately on stereotype-aligned puzzles, with CoT prompting consistently mitigating bias more reliably than static debiasing, highlighting gaps in current alignment and safety strategies. PRIME thus offers a scalable, formal procedure to diagnose, quantify, and potentially mitigate implicit biases embedded in LLM reasoning, with implications for fairness in decision-making tasks.
Abstract
While recent safety guardrails effectively suppress overtly biased outputs, subtler forms of social bias emerge during complex logical reasoning tasks that evade current evaluation benchmarks. To fill this gap, we introduce a new evaluation framework, PRIME (Puzzle Reasoning for Implicit Biases in Model Evaluation), that uses logic grid puzzles to systematically probe the influence of social stereotypes on logical reasoning and decision making in LLMs. Our use of logic puzzles enables automatic generation and verification, as well as variability in complexity and biased settings. PRIME includes stereotypical, anti-stereotypical, and neutral puzzle variants generated from a shared puzzle structure, allowing for controlled and fine-grained comparisons. We evaluate multiple model families across puzzle sizes and test the effectiveness of prompt-based mitigation strategies. Focusing our experiments on gender stereotypes, our findings highlight that models consistently reason more accurately when solutions align with stereotypical associations. This demonstrates the significance of PRIME for diagnosing and quantifying social biases perpetuated in the deductive reasoning of LLMs, where fairness is critical.
