Balancing Rigor and Utility: Mitigating Cognitive Biases in Large Language Models for Multiple-Choice Questions

Hanyang Zhong; Liman Wang; Wenting Cao; Zeyuan Sun

Balancing Rigor and Utility: Mitigating Cognitive Biases in Large Language Models for Multiple-Choice Questions

Hanyang Zhong, Liman Wang, Wenting Cao, Zeyuan Sun

TL;DR

This work investigates cognitive biases in LLM MCQ decision-making, arguing that rational deviations can enhance efficiency when properly moderated. It introduces abstention and heuristic moderation, evaluated on the BRU dataset across GPT-4, Gemini 1.0 Pro, and LLaMA3-70B, with a Bias Detection Loop that transitions from general to specific bias inspection. Key findings show that SBI combined with abstention yields the highest accuracy and lowest error, while scaling bias inspection and dynamic bias detection improve reliability and human-aligned reasoning. The results suggest a practical framework for deploying LLMs in decision-support tasks where uncertainty and bias risk are nontrivial, enabling more trusted and efficient AI reasoning.

Abstract

This paper examines the role of cognitive biases in the decision-making processes of large language models (LLMs), challenging the conventional goal of eliminating all biases. When properly balanced, we show that certain cognitive biases can enhance decision-making efficiency through rational deviations and heuristic shortcuts. By introducing heuristic moderation and an abstention option, which allows LLMs to withhold responses when uncertain, we reduce error rates, improve decision accuracy, and optimize decision rates. Using the Balance Rigor and Utility (BRU) dataset, developed through expert collaboration, our findings demonstrate that targeted inspection of cognitive biases aligns LLM decisions more closely with human reasoning, enhancing reliability and suggesting strategies for future improvements. This approach offers a novel way to leverage cognitive biases to improve the practical utility of LLMs across various applications.

Balancing Rigor and Utility: Mitigating Cognitive Biases in Large Language Models for Multiple-Choice Questions

TL;DR

Abstract

Paper Structure (24 sections, 3 equations, 5 figures, 21 tables, 1 algorithm)

This paper contains 24 sections, 3 equations, 5 figures, 21 tables, 1 algorithm.

Introduction
Related Works
Methodology
Phenomenon and Method Formation
Strategic Abstention
Scaling the Inspection Scope
Feedback Loop with Bias Detection
Dataset and Experimental Setup
Dataset Setup
Models and Prompting
Abstention Prompting
Non-Abstention Prompting
General Bias Inspection
Specific Bias Inspection
Bias Detection Module
...and 9 more sections

Figures (5)

Figure 1: Valid vote accuracy and error rates on the BRU dataset for LLMs balancing rational deviations, both with and without the option to abstain. 'Sta' represents the standard baseline used for comparison, while 'GBI' and 'SBI' denote the proposed prompting strategies, as detailed in Section 4.
Figure 2: QA examples from GPT-4. The Conjunction Fallacy is a subset of cognitive biases. Scaling the scope of bias inspection can influence rational deviations, thereby impacting the outcomes of LLMs' reasoning. To address this, we propose a feedback loop Bias Detection module to identify the type of bias and adjust the inspection scope when an abstention from answering is considered. This approach ensures that LLMs provide more accurate responses by systematically addressing biases during decision-making. The detailed demonstration of the whole workflow is shown in Appendix Table 18-21.
Figure 3: The combination of TT, TF, FT, FF, and O rates for GPT-4, Gemini 1.0 Pro, and LLaMA3-70B on the BRU dataset using different prompting strategies. 'NA-' denotes Non-Abstention, 'A-' denotes Abstention, and 'Sta' represents the Standard used for comparison. The detailed distributions of the TT, TF, FT, FF, and O rates for GPT-4, Gemini 1.0 Pro and LLaMA3-70B are elaborately listed in Appendix Tables 5, 6 and 7.
Figure 4: Distribution chart of abstention rates for GPT-4, Gemini 1.0 Pro, and LLaMA3-70B across different question types in the BRU dataset with Abstention enabled and using different prompting strategies.
Figure 5: This diagram pertains to the specific details of dataset design and the classification of questions, with the numbers in parentheses indicating the quantity of questions in each category.

Balancing Rigor and Utility: Mitigating Cognitive Biases in Large Language Models for Multiple-Choice Questions

TL;DR

Abstract

Balancing Rigor and Utility: Mitigating Cognitive Biases in Large Language Models for Multiple-Choice Questions

Authors

TL;DR

Abstract

Table of Contents

Figures (5)