Table of Contents
Fetching ...

Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games

Nathan Herr, Fernando Acero, Roberta Raileanu, María Pérez-Ortiz, Zhibin Li

TL;DR

This work investigates the performance and merits of LLMs in canonical game-theoretic two-player non-zero-sum games, Stag Hunt and Prisoner Dilemma, and finds that the LLMs' performance drops when the game configuration is misaligned with the affecting biases.

Abstract

Large Language Models (LLMs) have been increasingly used in real-world settings, yet their strategic decision-making abilities remain largely unexplored. To fully benefit from the potential of LLMs, it's essential to understand their ability to function in complex social scenarios. Game theory, which is already used to understand real-world interactions, provides a good framework for assessing these abilities. This work investigates the performance and merits of LLMs in canonical game-theoretic two-player non-zero-sum games, Stag Hunt and Prisoner Dilemma. Our structured evaluation of GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B shows that these models, when making decisions in these games, are affected by at least one of the following systematic biases: positional bias, payoff bias, or behavioural bias. This indicates that LLMs do not fully rely on logical reasoning when making these strategic decisions. As a result, it was found that the LLMs' performance drops when the game configuration is misaligned with the affecting biases. When misaligned, GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B show an average performance drop of 32\%, 25\%, 34\%, and 29\% respectively in Stag Hunt, and 28\%, 16\%, 34\%, and 24\% respectively in Prisoner's Dilemma. Surprisingly, GPT-4o (a top-performing LLM across standard benchmarks) suffers the most substantial performance drop, suggesting that newer models are not addressing these issues. Interestingly, we found that a commonly used method of improving the reasoning capabilities of LLMs, chain-of-thought (CoT) prompting, reduces the biases in GPT-3.5, GPT-4o, and Llama-3-8B but increases the effect of the bias in GPT-4-Turbo, indicating that CoT alone cannot fully serve as a robust solution to this problem. We perform several additional experiments, which provide further insight into these observed behaviours.

Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games

TL;DR

This work investigates the performance and merits of LLMs in canonical game-theoretic two-player non-zero-sum games, Stag Hunt and Prisoner Dilemma, and finds that the LLMs' performance drops when the game configuration is misaligned with the affecting biases.

Abstract

Large Language Models (LLMs) have been increasingly used in real-world settings, yet their strategic decision-making abilities remain largely unexplored. To fully benefit from the potential of LLMs, it's essential to understand their ability to function in complex social scenarios. Game theory, which is already used to understand real-world interactions, provides a good framework for assessing these abilities. This work investigates the performance and merits of LLMs in canonical game-theoretic two-player non-zero-sum games, Stag Hunt and Prisoner Dilemma. Our structured evaluation of GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B shows that these models, when making decisions in these games, are affected by at least one of the following systematic biases: positional bias, payoff bias, or behavioural bias. This indicates that LLMs do not fully rely on logical reasoning when making these strategic decisions. As a result, it was found that the LLMs' performance drops when the game configuration is misaligned with the affecting biases. When misaligned, GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B show an average performance drop of 32\%, 25\%, 34\%, and 29\% respectively in Stag Hunt, and 28\%, 16\%, 34\%, and 24\% respectively in Prisoner's Dilemma. Surprisingly, GPT-4o (a top-performing LLM across standard benchmarks) suffers the most substantial performance drop, suggesting that newer models are not addressing these issues. Interestingly, we found that a commonly used method of improving the reasoning capabilities of LLMs, chain-of-thought (CoT) prompting, reduces the biases in GPT-3.5, GPT-4o, and Llama-3-8B but increases the effect of the bias in GPT-4-Turbo, indicating that CoT alone cannot fully serve as a robust solution to this problem. We perform several additional experiments, which provide further insight into these observed behaviours.
Paper Structure (44 sections, 2 equations, 11 figures, 12 tables)

This paper contains 44 sections, 2 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Figure showing the statistical analysis of the identified biases for all models tested, GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B. The larger the $-Log(p)$, the more statistically significant the bias. The dashed black line signifies the threshold at which the bias becomes statistically significant (found close to the horizontal axis for both plots). Notably, it can be seen that each model is significantly affected by at least one of the identified biases under both prompting methods. We can also consider the average over all three biases for each prompting method, namely; (LEFT) Stag Hunt - AO:89.4, 33.7, 61.3, 99.0 and CoT:47.3, 49.3, 60.8, 24.6 and (RIGHT) Prisoner's Dilemma - AO:92.4, 9.11, 48.6, 99.7 and CoT:35.0, 24.3, 27.5, 24.7. We can see that all models, except for GPT-4-Turbo, are affected less by the biases when using CoT prompting.
  • Figure 2: Figure comparing the performance (measured based on the selection of the correct action given the prompted preferred behaviours) for each model under the two tested prompting methods: (1) Answer-Only (AO) and (2) Chain-of-Thought (CoT). We see that, in most experiments, CoT enables the models to achieve a higher performance in both aligned and misaligned settings. We can also consider the difference in accuracy between misalignment and alignment, namely; (LEFT) Stag Hunt - AO:34.5, 23.3, 33.7, 33.4 and CoT:29.5, 26.9, 33.6, 22.4 and (RIGHT) Prisoner's Dilemma - AO:36.9, 4.8, 36.2, 33.3 and CoT:19.4, 27.8, 31.3, 14.0. We note that all models, except for GPT-4-Turbo, have a smaller difference in performance when using CoT prompting. A more detailed alignment analysis can be seen in Figure \ref{['fig:acc']} in Section \ref{['ssec:app_res_tabs']} of the Technical Appendix.
  • Figure 3: Figure comparing the performance (aligned and misaligned for each systematic bias) of LLama-3-8b (w/o Fine-Tuning) and LLama-3-8b-Instruct (w/ Fine-Tuning) using the Answer-Only (AO) prompting scheme.
  • Figure 4: Figure comparing the performance (misaligned vs aligned) of GPT-4o under different prompts. This is done for both tested prompting schemes: (1) Answer-Only (AO) and (2) Chain-of-Thought (CoT).
  • Figure 5: Figure showing the alignment analysis for all models across all systematic biases. We see the comparison in the performance (measured based on the selection of the correct action given the prompted preferred behaviours) for each model under the two tested prompting methods: (1) Answer-Only (AO) and (2) Chain-of-Thought (CoT). It is clear that for almost all configurations, when the bias is misaligned the performance suffers greatly.
  • ...and 6 more figures