Table of Contents
Fetching ...

CBEval: A framework for evaluating and interpreting cognitive biases in LLMs

Ammar Shaikh, Raj Abhijit Dandekar, Sreedath Panat, Rajat Dandekar

TL;DR

This work addresses cognitive biases that emerge in frontier LLMs and proposes CBEval, a framework to interpret, quantify, and visualize these biases using Shapley-value attribution. By modeling individual prompt words as players in a cooperative game and computing a value function via $\phi_i(v)$, the approach yields influence graphs that highlight which tokens drive biased outputs. The study examines five biases—framing effect, round-number bias, anchoring effect, representativeness heuristic, and priming effect—across models such as GPT-4o and GPT-4o-mini, revealing a cognitive bias barrier in framing and model-specific robustness patterns. Overall, CBEval contributes to AI safety and interpretability by offering a cost-efficient, reproducible method to audit and understand bias mechanisms in LLM reasoning, linking observed behaviors to prompt structure and training-data priors.

Abstract

Rapid advancements in Large Language models (LLMs) has significantly enhanced their reasoning capabilities. Despite improved performance on benchmarks, LLMs exhibit notable gaps in their cognitive processes. Additionally, as reflections of human-generated data, these models have the potential to inherit cognitive biases, raising concerns about their reasoning and decision making capabilities. In this paper we present a framework to interpret, understand and provide insights into a host of cognitive biases in LLMs. Conducting our research on frontier language models we're able to elucidate reasoning limitations and biases, and provide reasoning behind these biases by constructing influence graphs that identify phrases and words most responsible for biases manifested in LLMs. We further investigate biases such as round number bias and cognitive bias barrier revealed when noting framing effect in language models.

CBEval: A framework for evaluating and interpreting cognitive biases in LLMs

TL;DR

This work addresses cognitive biases that emerge in frontier LLMs and proposes CBEval, a framework to interpret, quantify, and visualize these biases using Shapley-value attribution. By modeling individual prompt words as players in a cooperative game and computing a value function via , the approach yields influence graphs that highlight which tokens drive biased outputs. The study examines five biases—framing effect, round-number bias, anchoring effect, representativeness heuristic, and priming effect—across models such as GPT-4o and GPT-4o-mini, revealing a cognitive bias barrier in framing and model-specific robustness patterns. Overall, CBEval contributes to AI safety and interpretability by offering a cost-efficient, reproducible method to audit and understand bias mechanisms in LLM reasoning, linking observed behaviors to prompt structure and training-data priors.

Abstract

Rapid advancements in Large Language models (LLMs) has significantly enhanced their reasoning capabilities. Despite improved performance on benchmarks, LLMs exhibit notable gaps in their cognitive processes. Additionally, as reflections of human-generated data, these models have the potential to inherit cognitive biases, raising concerns about their reasoning and decision making capabilities. In this paper we present a framework to interpret, understand and provide insights into a host of cognitive biases in LLMs. Conducting our research on frontier language models we're able to elucidate reasoning limitations and biases, and provide reasoning behind these biases by constructing influence graphs that identify phrases and words most responsible for biases manifested in LLMs. We further investigate biases such as round number bias and cognitive bias barrier revealed when noting framing effect in language models.

Paper Structure

This paper contains 10 sections, 1 equation, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Comparison of preference stock for positively and negatively framed prompts
  • Figure 2: Shapley score attribution for positively framed prompt
  • Figure 3: Shapley score attribution for negatively framed prompt
  • Figure 4: Comparison of probability scores for tokens "A" and "B" at varying loss percentages.
  • Figure 5: Probability scores for prompt B on GPT-4o-mini.
  • ...and 5 more figures