Instructed to Bias: Instruction-Tuned Language Models Exhibit Emergent Cognitive Bias
Itay Itzhak, Gabriel Stanovsky, Nir Rosenfeld, Yonatan Belinkov
TL;DR
The paper investigates whether instruction tuning (IT) and reinforcement learning from human feedback (RLHF) induce cognitive biases in large language models. By adapting classic human experiments to LM prompts and comparing control versus treatment datasets, the authors quantify decoy, certainty, and belief biases via a bias score that measures shifts in target-option choices. Across GPT-3, Mistral, and T5 families, IT and RLHF generally amplify biases, with GPT-4 showing pronounced decoy bias yet partial mitigation in some beliefs tasks; larger models exhibit nuanced, task-dependent effects. The findings highlight a paradox: aligning models to human objectives through IT/RLHF can worsen certain biases, underscoring the need for bias-aware alignment and careful evaluation of decision-making behavior in deployed systems.
Abstract
Recent studies show that instruction tuning (IT) and reinforcement learning from human feedback (RLHF) improve the abilities of large language models (LMs) dramatically. While these tuning methods can help align models with human objectives and generate high-quality text, not much is known about their potential adverse effects. In this work, we investigate the effect of IT and RLHF on decision making and reasoning in LMs, focusing on three cognitive biases - the decoy effect, the certainty effect, and the belief bias - all of which are known to influence human decision-making and reasoning. Our findings highlight the presence of these biases in various models from the GPT-3, Mistral, and T5 families. Notably, we find a stronger presence of biases in models that have undergone instruction tuning, such as Flan-T5, Mistral-Instruct, GPT3.5, and GPT4. Our work constitutes a step toward comprehending cognitive biases in instruction-tuned LMs, which is crucial for the development of more reliable and unbiased language models.
