Table of Contents
Fetching ...

AI Can Be Cognitively Biased: An Exploratory Study on Threshold Priming in LLM-Based Batch Relevance Assessment

Nuo Chen, Jiqun Liu, Xiaoyu Dong, Qijiong Liu, Tetsuya Sakai, Xiao-Ming Wu

TL;DR

This study investigates whether LLMs exhibit threshold priming bias when performing batch relevance judgments in information retrieval. Using ten TRDL19 topics and $20$ trials per topic, the authors test GPT-$3.5$, GPT-$4$, LLaMa-$2$-$13$B, and LLaMa-$2$-$70$B across prologue-epilogue batch designs with varying prologue lengths $PL$ and epilogue lengths $EL$. The results show robust threshold priming across most configurations, with model- and topic-dependent variations; GPT models generally show stronger effects than LLaMa variants, and certain topics (e.g., 451602) exhibit anchoring-like behavior or minimal priming. The findings imply that cognitive biases can influence automated relevance labeling, motivating bias-aware auditing, robust prompting strategies, and human-in-the-loop approaches to mitigate such biases in IR evaluation and beyond.

Abstract

Cognitive biases are systematic deviations in thinking that lead to irrational judgments and problematic decision-making, extensively studied across various fields. Recently, large language models (LLMs) have shown advanced understanding capabilities but may inherit human biases from their training data. While social biases in LLMs have been well-studied, cognitive biases have received less attention, with existing research focusing on specific scenarios. The broader impact of cognitive biases on LLMs in various decision-making contexts remains underexplored. We investigated whether LLMs are influenced by the threshold priming effect in relevance judgments, a core task and widely-discussed research topic in the Information Retrieval (IR) coummunity. The priming effect occurs when exposure to certain stimuli unconsciously affects subsequent behavior and decisions. Our experiment employed 10 topics from the TREC 2019 Deep Learning passage track collection, and tested AI judgments under different document relevance scores, batch lengths, and LLM models, including GPT-3.5, GPT-4, LLaMa2-13B and LLaMa2-70B. Results showed that LLMs tend to give lower scores to later documents if earlier ones have high relevance, and vice versa, regardless of the combination and model used. Our finding demonstrates that LLM%u2019s judgments, similar to human judgments, are also influenced by threshold priming biases, and suggests that researchers and system engineers should take into account potential human-like cognitive biases in designing, evaluating, and auditing LLMs in IR tasks and beyond.

AI Can Be Cognitively Biased: An Exploratory Study on Threshold Priming in LLM-Based Batch Relevance Assessment

TL;DR

This study investigates whether LLMs exhibit threshold priming bias when performing batch relevance judgments in information retrieval. Using ten TRDL19 topics and trials per topic, the authors test GPT-, GPT-, LLaMa--B, and LLaMa--B across prologue-epilogue batch designs with varying prologue lengths and epilogue lengths . The results show robust threshold priming across most configurations, with model- and topic-dependent variations; GPT models generally show stronger effects than LLaMa variants, and certain topics (e.g., 451602) exhibit anchoring-like behavior or minimal priming. The findings imply that cognitive biases can influence automated relevance labeling, motivating bias-aware auditing, robust prompting strategies, and human-in-the-loop approaches to mitigate such biases in IR evaluation and beyond.

Abstract

Cognitive biases are systematic deviations in thinking that lead to irrational judgments and problematic decision-making, extensively studied across various fields. Recently, large language models (LLMs) have shown advanced understanding capabilities but may inherit human biases from their training data. While social biases in LLMs have been well-studied, cognitive biases have received less attention, with existing research focusing on specific scenarios. The broader impact of cognitive biases on LLMs in various decision-making contexts remains underexplored. We investigated whether LLMs are influenced by the threshold priming effect in relevance judgments, a core task and widely-discussed research topic in the Information Retrieval (IR) coummunity. The priming effect occurs when exposure to certain stimuli unconsciously affects subsequent behavior and decisions. Our experiment employed 10 topics from the TREC 2019 Deep Learning passage track collection, and tested AI judgments under different document relevance scores, batch lengths, and LLM models, including GPT-3.5, GPT-4, LLaMa2-13B and LLaMa2-70B. Results showed that LLMs tend to give lower scores to later documents if earlier ones have high relevance, and vice versa, regardless of the combination and model used. Our finding demonstrates that LLM%u2019s judgments, similar to human judgments, are also influenced by threshold priming biases, and suggests that researchers and system engineers should take into account potential human-like cognitive biases in designing, evaluating, and auditing LLMs in IR tasks and beyond.
Paper Structure (9 sections, 5 equations, 4 figures, 2 tables)

This paper contains 9 sections, 5 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: An example of the priming effect can be illustrated as follows: In the left-hand context, where the stimuli are all related to food, individuals are more likely to associate the stimulus with "soup." Conversely, in the right-hand context, where the stimuli pertain to bathing, individuals are more likely to associate the stimulus with "soap."
  • Figure 2: An example of the methodology adopted in our experiment. In both the left batch and the right batch, the documents comprising the epilogue are identical. However, influenced by the relevance threshold of the documents in the prologue, the LLM assigned different relevance scores to the same documents.
  • Figure 3: An example of the prompt used in our experiment. This is a low threshold batch. The gray text is system prompt, the blue text is the prologue composed of documents with a ground truth relevance of 0, and the dark orange text is the epilogue.
  • Figure 4: The average predicted scores for the documents in the epilogue of GPT-3.5, GPT-4o, LLaMa-13B and LLaMa-70B respectively. The results from left to right correspond to GPT-3.5, GPT-4, LLaMa-13B, and LLaMa-70B. From top to bottom, the results are for the conditions with a prologue length (PL) of 4 and an Epilogue length (EL) of 4, a PL of 4 and an EL of 8, and a PL of 8 and an EL of 8. Note that the ground truth relevances of all documents are 2. The p-value is obtained from a dependent t-test.