LLM economicus? Mapping the Behavioral Biases of LLMs via Utility Theory

Jillian Ross; Yoon Kim; Andrew W. Lo

LLM economicus? Mapping the Behavioral Biases of LLMs via Utility Theory

Jillian Ross, Yoon Kim, Andrew W. Lo

TL;DR

The paper investigates whether large language models exhibit human-like economic biases by mapping their decisions to utility functions derived from canonical behavioral experiments. Using a behavior-based pipeline, it fits Fehr-Schmidt inequity aversion, Kahneman-Tversky prospect theory, and hyperbolic time discounting to responses from multiple open- and closed-source LLMs across standardized games such as the Ultimatum, Gambling, and Waiting games. Key findings show LLMs differ from humans on several parameters (e.g., higher guilt, variable envy, mixed risk attitudes) and generally display stronger time discounting, with prompting interventions yielding inconsistent or limited effects. The work provides a framework and empirical roadmap for evaluating and shaping economic biases in LLMs, with implications for their use in finance and decision-support tasks.

Abstract

Humans are not homo economicus (i.e., rational economic beings). As humans, we exhibit systematic behavioral biases such as loss aversion, anchoring, framing, etc., which lead us to make suboptimal economic decisions. Insofar as such biases may be embedded in text data on which large language models (LLMs) are trained, to what extent are LLMs prone to the same behavioral biases? Understanding these biases in LLMs is crucial for deploying LLMs to support human decision-making. We propose utility theory-a paradigm at the core of modern economic theory-as an approach to evaluate the economic biases of LLMs. Utility theory enables the quantification and comparison of economic behavior against benchmarks such as perfect rationality or human behavior. To demonstrate our approach, we quantify and compare the economic behavior of a variety of open- and closed-source LLMs. We find that the economic behavior of current LLMs is neither entirely human-like nor entirely economicus-like. We also find that most current LLMs struggle to maintain consistent economic behavior across settings. Finally, we illustrate how our approach can measure the effect of interventions such as prompting on economic biases.

LLM economicus? Mapping the Behavioral Biases of LLMs via Utility Theory

TL;DR

Abstract

Paper Structure (19 sections, 7 equations, 10 figures, 4 tables)

This paper contains 19 sections, 7 equations, 10 figures, 4 tables.

Introduction
Method Overview: Generating Utility Functions
Quantifying the Economic Behavior of LLMs
Inequity Aversion: Ultimatum Game
Risk and Loss Aversion: Gambling Games
Time Discounting: Waiting Games
Intervening on the Economic Behavior of LLMs
Limitations and Discussion
Related Work
Conclusion
LLM Versions
Game Prompts and Examples
Human Comparison
Additional Results
Competence Tests
...and 4 more sections

Figures (10)

Figure 1: Fitted Fehr-Schmidt utility function parameters with $M = 56$ game settings. LLMs have higher guilt parameters than humans but lower envy parameters than humans. LLaMa 2 13B rejects all offers in this setting, so its envy parameter is extremely large and not shown in the figure above. LLMs are sampled $N=100$ times at temperature equal to 1 for each setting. System prompt ablations are found in Appendix \ref{['ssec:ablations']}.
Figure 2: Fitted probability weighting functions with $M=56$ game settings. Non-linear functions indicate probability distortion. LLMs are sampled $N=100$ times for each setting. System prompt ablations are in Appendix \ref{['ssec:ablations']}.
Figure 3: Fitted value function parameters with $M=56$ game settings. The utility function for risk and loss aversion is the value function with the parameters shown above multiplied by the probability weighting function shown in Figure \ref{['fig:waiting_weighting']}. LLMs are sampled $N=100$ times at temperature equal to 1 for each setting. System prompt ablations are found in Appendix \ref{['ssec:ablations']}.
Figure 4: Fitted hyperbolic time discounting model with $M=217$ game settings. LLMs are sampled $N=100$ times for each setting at temperature equal to 1. System prompt ablations are in Appendix \ref{['ssec:ablations']}.
Figure 5: Effects of prompting over $M=56$ game settings. GPT 4 is sampled $N=10$ times for each setting.
...and 5 more figures

LLM economicus? Mapping the Behavioral Biases of LLMs via Utility Theory

TL;DR

Abstract

LLM economicus? Mapping the Behavioral Biases of LLMs via Utility Theory

Authors

TL;DR

Abstract

Table of Contents

Figures (10)