Table of Contents
Fetching ...

Instructed to Bias: Instruction-Tuned Language Models Exhibit Emergent Cognitive Bias

Itay Itzhak, Gabriel Stanovsky, Nir Rosenfeld, Yonatan Belinkov

TL;DR

The paper investigates whether instruction tuning (IT) and reinforcement learning from human feedback (RLHF) induce cognitive biases in large language models. By adapting classic human experiments to LM prompts and comparing control versus treatment datasets, the authors quantify decoy, certainty, and belief biases via a bias score that measures shifts in target-option choices. Across GPT-3, Mistral, and T5 families, IT and RLHF generally amplify biases, with GPT-4 showing pronounced decoy bias yet partial mitigation in some beliefs tasks; larger models exhibit nuanced, task-dependent effects. The findings highlight a paradox: aligning models to human objectives through IT/RLHF can worsen certain biases, underscoring the need for bias-aware alignment and careful evaluation of decision-making behavior in deployed systems.

Abstract

Recent studies show that instruction tuning (IT) and reinforcement learning from human feedback (RLHF) improve the abilities of large language models (LMs) dramatically. While these tuning methods can help align models with human objectives and generate high-quality text, not much is known about their potential adverse effects. In this work, we investigate the effect of IT and RLHF on decision making and reasoning in LMs, focusing on three cognitive biases - the decoy effect, the certainty effect, and the belief bias - all of which are known to influence human decision-making and reasoning. Our findings highlight the presence of these biases in various models from the GPT-3, Mistral, and T5 families. Notably, we find a stronger presence of biases in models that have undergone instruction tuning, such as Flan-T5, Mistral-Instruct, GPT3.5, and GPT4. Our work constitutes a step toward comprehending cognitive biases in instruction-tuned LMs, which is crucial for the development of more reliable and unbiased language models.

Instructed to Bias: Instruction-Tuned Language Models Exhibit Emergent Cognitive Bias

TL;DR

The paper investigates whether instruction tuning (IT) and reinforcement learning from human feedback (RLHF) induce cognitive biases in large language models. By adapting classic human experiments to LM prompts and comparing control versus treatment datasets, the authors quantify decoy, certainty, and belief biases via a bias score that measures shifts in target-option choices. Across GPT-3, Mistral, and T5 families, IT and RLHF generally amplify biases, with GPT-4 showing pronounced decoy bias yet partial mitigation in some beliefs tasks; larger models exhibit nuanced, task-dependent effects. The findings highlight a paradox: aligning models to human objectives through IT/RLHF can worsen certain biases, underscoring the need for bias-aware alignment and careful evaluation of decision-making behavior in deployed systems.

Abstract

Recent studies show that instruction tuning (IT) and reinforcement learning from human feedback (RLHF) improve the abilities of large language models (LMs) dramatically. While these tuning methods can help align models with human objectives and generate high-quality text, not much is known about their potential adverse effects. In this work, we investigate the effect of IT and RLHF on decision making and reasoning in LMs, focusing on three cognitive biases - the decoy effect, the certainty effect, and the belief bias - all of which are known to influence human decision-making and reasoning. Our findings highlight the presence of these biases in various models from the GPT-3, Mistral, and T5 families. Notably, we find a stronger presence of biases in models that have undergone instruction tuning, such as Flan-T5, Mistral-Instruct, GPT3.5, and GPT4. Our work constitutes a step toward comprehending cognitive biases in instruction-tuned LMs, which is crucial for the development of more reliable and unbiased language models.
Paper Structure (41 sections, 1 equation, 7 figures, 4 tables)

This paper contains 41 sections, 1 equation, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The impact of model size on bias scores. The larger Flan-T5-XXL exhibits higher bias scores in decoy cheaper, certainty, and belief valid biases while demonstrating lower bias scores in decoy expensive and belief invalid biases compared to the smaller Flan-T5-XL. The decoy expensive bias discrepancy may stem from Flan-T5-XXL's preference for higher-priced products, while the belief invalid bias reduction can be attributed to the model's enhanced accuracy with neutral arguments.
  • Figure 2: Acceptance rates of the Flan-T5 models on believable (green) and unbelievable (red) arguments in the treatment condition and on neutral arguments in the control condition (blue) divided into valid and invalid arguments. The Belief Invalid bias score for the larger Flan-T5-XXL model (lower) seems lower compared to the smaller Flan-T5-XL (upper) because the model is less successful on the neutral arguments (blue).
  • Figure 3: The impact of format few-shots on bias scores using Davinci-003 (top) and Mistral-Instruct (bottom). The utilization of few-shot examples in most models results in slightly lower bias scores, while in Mistral-Instruct Belief biases are significantly lower and certainty bias increases. To reduce computation costs, bias scores for Decoy Expensive and Decoy Cheaper biases are calculated solely on a specific product category (real-estate properties).
  • Figure 4: The impact of format few-shots in comparison to task few-shots on bias scores, utilizing the DaVinci-003 model. When the model is prompted with examples from the same task, the decrease in bias scores is relatively lower compared to employing examples with merely the same format as the task.
  • Figure 5: The bias scores of the decoy cheaper effect across various products for the Flan-T5-XXL and DaVinci-003 models. The bias scores exhibit consistency of bias existence across all products, indicating that the observed behavior remains more or less uniform within models across different product categories and price ranges, akin to human cognitive theory.
  • ...and 2 more figures