Table of Contents
Fetching ...

Large Language Models are Biased Reinforcement Learners

William M. Hayes, Nicolas Yax, Stefano Palminteri

TL;DR

This paper examines whether large language models exhibit relative value biases when used as in-context reinforcement learners in bandit tasks. Using five bandit tasks, four transformer-based models, and two prompt designs, the authors show that LLMs can learn from in-context feedback but display a relative value bias—especially with explicit outcome comparisons. Computational modeling reveals that a simple RL framework with both absolute and relative outcome encodings best describes behavior, with the relative component amplified by comparisons prompts. Hidden-state analyses indicate that relative-value information is encoded in final-layer activations, even in pretrained models, underscoring the bias's broad presence. The findings have practical implications for deploying LLMs in decision-making and highlight the need for prompting strategies to mitigate such biases while extending analysis to more models and tasks.

Abstract

In-context learning enables large language models (LLMs) to perform a variety of tasks, including learning to make reward-maximizing choices in simple bandit tasks. Given their potential use as (autonomous) decision-making agents, it is important to understand how these models perform such reinforcement learning (RL) tasks and the extent to which they are susceptible to biases. Motivated by the fact that, in humans, it has been widely documented that the value of an outcome depends on how it compares to other local outcomes, the present study focuses on whether similar value encoding biases apply to how LLMs encode rewarding outcomes. Results from experiments with multiple bandit tasks and models show that LLMs exhibit behavioral signatures of a relative value bias. Adding explicit outcome comparisons to the prompt produces opposing effects on performance, enhancing maximization in trained choice sets but impairing generalization to new choice sets. Computational cognitive modeling reveals that LLM behavior is well-described by a simple RL algorithm that incorporates relative values at the outcome encoding stage. Lastly, we present preliminary evidence that the observed biases are not limited to fine-tuned LLMs, and that relative value processing is detectable in the final hidden layer activations of a raw, pretrained model. These findings have important implications for the use of LLMs in decision-making applications.

Large Language Models are Biased Reinforcement Learners

TL;DR

This paper examines whether large language models exhibit relative value biases when used as in-context reinforcement learners in bandit tasks. Using five bandit tasks, four transformer-based models, and two prompt designs, the authors show that LLMs can learn from in-context feedback but display a relative value bias—especially with explicit outcome comparisons. Computational modeling reveals that a simple RL framework with both absolute and relative outcome encodings best describes behavior, with the relative component amplified by comparisons prompts. Hidden-state analyses indicate that relative-value information is encoded in final-layer activations, even in pretrained models, underscoring the bias's broad presence. The findings have practical implications for deploying LLMs in decision-making and highlight the need for prompting strategies to mitigate such biases while extending analysis to more models and tasks.

Abstract

In-context learning enables large language models (LLMs) to perform a variety of tasks, including learning to make reward-maximizing choices in simple bandit tasks. Given their potential use as (autonomous) decision-making agents, it is important to understand how these models perform such reinforcement learning (RL) tasks and the extent to which they are susceptible to biases. Motivated by the fact that, in humans, it has been widely documented that the value of an outcome depends on how it compares to other local outcomes, the present study focuses on whether similar value encoding biases apply to how LLMs encode rewarding outcomes. Results from experiments with multiple bandit tasks and models show that LLMs exhibit behavioral signatures of a relative value bias. Adding explicit outcome comparisons to the prompt produces opposing effects on performance, enhancing maximization in trained choice sets but impairing generalization to new choice sets. Computational cognitive modeling reveals that LLM behavior is well-described by a simple RL algorithm that incorporates relative values at the outcome encoding stage. Lastly, we present preliminary evidence that the observed biases are not limited to fine-tuned LLMs, and that relative value processing is detectable in the final hidden layer activations of a raw, pretrained model. These findings have important implications for the use of LLMs in decision-making applications.
Paper Structure (21 sections, 3 equations, 19 figures, 6 tables)

This paper contains 21 sections, 3 equations, 19 figures, 6 tables.

Figures (19)

  • Figure 1: (a) A bandit task with eight options grouped into four contexts. Each context has a lower value option and a higher value option. During the initial training phase, the options produce Gaussian-distributed rewards (means in parentheses, standard deviation of $1). (b) All pairwise combinations of options in the transfer test. Blue lines show the originally trained pairs. Red lines show pairs for which absolute and relative values conflict. (c) Prompt designs used in all experiments. The comparisons prompt added explicit comparisons between the local outcomes delivered in each round.
  • Figure 2: (a-b) Mean choice accuracy (proportion of reward-maximizing choices) in the training phase and transfer test. Each colored point represents the mean accuracy for a specific combination of task, model, and prompt design across 30 runs (lines connect the same task/model combination). Means and standard errors are also shown. (c-d) Pairwise contrasts for the effect of prompt design, broken down by task and model. *p < .05 **p < .01 ***p < .001 (Bonferroni-adjusted for 20 tests).
  • Figure 3: Best-fitting models (%) across tasks (a), prompt designs (b), and LLMs (c).
  • Figure 4: (a) Estimated relative encoding parameters across tasks, (b) prompt designs, and (c) LLMs. (d) Estimated learning rates. In each panel, black points show the means and standard errors.
  • Figure 5: A list of the options in each bandit task. The notation $(x, p; y)$ means that $x$ occurred with probability $p$, otherwise $y$. The notation $N(a, b)$ refers to a normal distribution with mean $a$ and standard deviation $b$.
  • ...and 14 more figures