Discovering Differences in Strategic Behavior Between Humans and LLMs

Caroline Wang; Daniel Kasenberg; Kim Stachenfeld; Pablo Samuel Castro

Discovering Differences in Strategic Behavior Between Humans and LLMs

Caroline Wang, Daniel Kasenberg, Kim Stachenfeld, Pablo Samuel Castro

TL;DR

This work investigates how frontier LLMs differ from humans in strategic play within Iterated Rock-Paper-Scissors (IRPS) by automatically discovering interpretable behavioral models with AlphaEvolve. It combines large-scale human and matched-LLM datasets against a suite of bots and compares against baseline symbolic and neural models. The main contributions are (1) demonstrating higher, earlier exploitative performance by frontier LLMs, (2) showing AlphaEvolve can produce simple, interpretable programs that match predictive performance and reveal that frontier LLMs maintain more complex opponent models, and (3) providing a framework to analyze structural differences between human and LLM strategic behavior that extends beyond modality-specific statistics. The findings illuminate how LLMs may surpass humans in strategic reasoning in IRPS while highlighting limitations in long-horizon reasoning for certain models, informing both behavioral science and AI governance in interactive settings.

Abstract

As Large Language Models (LLMs) are increasingly deployed in social and strategic scenarios, it becomes critical to understand where and why their behavior diverges from that of humans. While behavioral game theory (BGT) provides a framework for analyzing behavior, existing models do not fully capture the idiosyncratic behavior of humans or black-box, non-human agents like LLMs. We employ AlphaEvolve, a cutting-edge program discovery tool, to directly discover interpretable models of human and LLM behavior from data, thereby enabling open-ended discovery of structural factors driving human and LLM behavior. Our analysis on iterated rock-paper-scissors reveals that frontier LLMs can be capable of deeper strategic behavior than humans. These results provide a foundation for understanding structural differences driving differences in human and LLM behavior in strategic interactions.

Discovering Differences in Strategic Behavior Between Humans and LLMs

TL;DR

Abstract

Paper Structure (53 sections, 2 equations, 11 figures, 4 tables, 2 algorithms)

This paper contains 53 sections, 2 equations, 11 figures, 4 tables, 2 algorithms.

Introduction
IRPS Datasets
IRPS Game and Bots
Human Dataset
Matched LLM Datasets
Methods
Problem Formulation: Behavior Modeling in IRPS
Evolving Behavioral Models using AlphaEvolve
Baseline Behavioral Models
Nash Equilibrium
Contextual Sophisticated Experience-Weighted Attraction (CS-EWA)
Recurrent Neural Network (RNN)
Comparing Human and LLM Strategic Behavior Using AlphaEvolve
Win Rates of Humans and LLMs
AlphaEvolve Models Improve Over Baselines
...and 38 more sections

Figures (11)

Figure 1: Diagrams of AlphaEvolve-discovered programs. The simplest-but-best programs for humans (left) and Gemini 2.5 Pro (right), discovered by AlphaEvolve on iterated rock-paper-scissors, are displayed. For simplicity, the learnable parameters $\theta$ of each program are omitted. Both programs use value-based learning and opponent modeling. The Gemini 2.5 Pro program displays more sophisticated opponent modeling than humans and considers counterfactual outcomes during value updates.
Figure 2: Win rates against nonadaptive bots. Bots are colored from least to most complex. The top row displays the aggregate win rate over all bots while the bottom row displays the win rate over time. Mean and 95% CIs are displayed, with some CIs not visible due to marker size.
Figure 3: AlphaEvolve improves over baseline behavioral models. The improvement in per-game normalized likelihood over the Nash baseline (which achieves a likelihood of $1/3$ on all datasets) is shown for each dataset and method. The mean and 95% confidence interval are displayed in the error bars.
Figure 4: Pareto Frontier of AlphaEvolve Programs. The AlphaEvolve program evaluated in Fig. \ref{['fig:leaderboard']} is denoted by the blue star, while the simplest-but-best program is indicated by the yellow star. Program simplicity is measured by the negative Halstead effort.
Figure 5: Cross-generalization of SBB Programs to each dataset. Cross-generalization matrix displaying evaluation likelihoods of each simplest-but-best discovered program on all datasets and 95% CI's.
...and 6 more figures

Discovering Differences in Strategic Behavior Between Humans and LLMs

TL;DR

Abstract

Discovering Differences in Strategic Behavior Between Humans and LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (11)