LLM Agents Display Human Biases but Exhibit Distinct Learning Patterns
Idan Horowitz, Ori Plonsky
TL;DR
This paper examines whether large language model (LLM) agents exhibit human-like decisions from experience (DFE) phenomena when learning from feedback across many trials. It compares two state-of-the-art LLMs (GPT-4o mini and Gemini-1.5 Flash-002) across chat and all-history contexts and multiple temperatures, against human participants, in four binary choice tasks with $100$ trials each. The study finds that both LLMs and humans underweight rare events and show a correlation effect, but the underlying cognitive processes differ dramatically: humans show a robust wavy recency effect and surprise-driven switches, whereas LLMs display a strong but non-wavy recency bias and no surprise-triggered changes. The results highlight that aggregate similarities do not imply similar learning mechanisms, signaling limits to using LLMs to predict or simulate human learning without more nuanced analyses and potential fine-tuning.
Abstract
We investigate the choice patterns of Large Language Models (LLMs) in the context of Decisions from Experience tasks that involve repeated choice and learning from feedback, and compare their behavior to human participants. We find that on the aggregate, LLMs appear to display behavioral biases similar to humans: both exhibit underweighting rare events and correlation effects. However, more nuanced analyses of the choice patterns reveal that this happens for very different reasons. LLMs exhibit strong recency biases, unlike humans, who appear to respond in more sophisticated ways. While these different processes may lead to similar behavior on average, choice patterns contingent on recent events differ vastly between the two groups. Specifically, phenomena such as ``surprise triggers change" and the ``wavy recency effect of rare events" are robustly observed in humans, but entirely absent in LLMs. Our findings provide insights into the limitations of using LLMs to simulate and predict humans in learning environments and highlight the need for refined analyses of their behavior when investigating whether they replicate human decision making tendencies.
