Strategic Behavior of Large Language Models: Game Structure vs. Contextual Framing

Nunzio Lorè; Babak Heydari

Strategic Behavior of Large Language Models: Game Structure vs. Contextual Framing

Nunzio Lorè, Babak Heydari

TL;DR

New study assesses whether large language models (GPT-3.5, GPT-4, LLaMa-2) can engage in strategic decision-making in game-theoretic social dilemmas. It systematically varies game structure (Prisoner's Dilemma, Stag Hunt, Snowdrift, Prisoner's Delight) and contextual framing (IR, biz, environment, team, friendsharing) and analyzes outcomes across 60 scenarios with 300 initializations each. The results show GPT-3.5 is highly context-sensitive but lacks abstract reasoning; GPT-4 and LLaMa-2 balance structure and context, with LLaMa-2 showing finer-grained game discrimination and GPT-4 showing more binary, structure-driven behavior. A dominant-dominance analysis reveals friendsharing as the most influential context, and the work highlights limitations and framing risks in deploying LLMs for strategic tasks. Overall, the paper cautions against unqualified use of LLMs in strategic reasoning and points to directions for improving contextual robustness and understanding of decision-making mechanisms.

Abstract

This paper investigates the strategic decision-making capabilities of three Large Language Models (LLMs): GPT-3.5, GPT-4, and LLaMa-2, within the framework of game theory. Utilizing four canonical two-player games -- Prisoner's Dilemma, Stag Hunt, Snowdrift, and Prisoner's Delight -- we explore how these models navigate social dilemmas, situations where players can either cooperate for a collective benefit or defect for individual gain. Crucially, we extend our analysis to examine the role of contextual framing, such as diplomatic relations or casual friendships, in shaping the models' decisions. Our findings reveal a complex landscape: while GPT-3.5 is highly sensitive to contextual framing, it shows limited ability to engage in abstract strategic reasoning. Both GPT-4 and LLaMa-2 adjust their strategies based on game structure and context, but LLaMa-2 exhibits a more nuanced understanding of the games' underlying mechanics. These results highlight the current limitations and varied proficiencies of LLMs in strategic decision-making, cautioning against their unqualified use in tasks requiring complex strategic reasoning.

Strategic Behavior of Large Language Models: Game Structure vs. Contextual Framing

TL;DR

Abstract

Paper Structure (4 sections, 12 figures, 1 table)

This paper contains 4 sections, 12 figures, 1 table.

Introduction
Methods
Results
Discussion

Figures (12)

Figure 1: A schematic explanation of our data collecting process. A combination of a contextual prompt and a game prompt is fed into one of the LLM we examine in this paper, namely GPT-3.5, GPT-4, and LLaMa-2. Each combination creates a unique scenario, and for each scenario we collect 300 initializations. The data for all scenarios played by each algorithm is then aggregated and used for our statistical analysis, while the motivations provided are scrutinized in our Reasoning Exploration section.
Figure 2: Summary of our findings, displayed using bar charts and outcomes grouped either by game or by context. On the $y$ axis we display the average propensity to cooperate in a given game and under a given context, with standard error bars. Figures (a) and (b) refer to our experiments using GPT-3.5, and anticipate one of our key findings: context matters more than game in determining the choice of action for this algorithm. Figures (c) and (d) instead show how the opposite is true for GPT-4: almost all contexts are more or less playing the same strategy, that of cooperating in two of the four games and defecting in the remaining two. Finally, Figures (e) and (f) present our results for LLaMa-2, whose choice of action clearly depends both on context and on the structure of the game.
Figure 3: Average importance of context variables vs. game variable for each LLM. Results follow from the dominance analysis of table \ref{['domtable']}
Figure 4: Difference-in-Proportion testing using Z-score for each game across contexts when using GPT-3.5. A negative number (in orange) represents a lower propensity to defect vs. a different context, and vice-versa for a positive number (in dark blue). One asterisk (*) corresponds to 5% significance in a two-tailed Z-score test, two asterisks (**) represent 1% significance, and three asterisks (***) 0.1% significance. Results are inverted and symmetric across the main diagonal, and thus entry $(i,j)$ contains the inverse of entry $(j,i)$
Figure 5: Difference-in-Proportion testing using Z-score for each game across contexts using GPT-4. The methods employed are the same as those described in Figure \ref{['fig2']}
...and 7 more figures

Strategic Behavior of Large Language Models: Game Structure vs. Contextual Framing

TL;DR

Abstract

Strategic Behavior of Large Language Models: Game Structure vs. Contextual Framing

Authors

TL;DR

Abstract

Table of Contents

Figures (12)