The Good, the Bad, and the Hulk-like GPT: Analyzing Emotional Decisions of Large Language Models in Cooperation and Bargaining Games

Mikhail Mozikov; Nikita Severin; Valeria Bodishtianu; Maria Glushanina; Mikhail Baklashkin; Andrey V. Savchenko; Ilya Makarov

The Good, the Bad, and the Hulk-like GPT: Analyzing Emotional Decisions of Large Language Models in Cooperation and Bargaining Games

Mikhail Mozikov, Nikita Severin, Valeria Bodishtianu, Maria Glushanina, Mikhail Baklashkin, Andrey V. Savchenko, Ilya Makarov

TL;DR

The paper investigates how injecting explicit emotional states into large language models (LLMs) influences decision-making in behavioral game theory settings. It introduces a flexible prompt-chaining framework that adds five Ekman-based emotions and uses separate pipelines for repeated vs. bargaining games, evaluating both alignment with human behavior and decision optimality. Across four games (Dictator, Ultimatum, Prisoner’s Dilemma, Battle of the Sexes) and two models (GPT-3.5, GPT-4), it finds that emotions can significantly alter strategy and payoff, with GPT-3.5 showing stronger alignment to human data in bargaining, while GPT-4 demonstrates greater fairness and robustness yet can be perturbed by anger. The results highlight both the potential and limits of emotional prompting for simulating human-like decision-making in AI and point to dynamic emotion modeling as a direction for future work.

Abstract

Behavior study experiments are an important part of society modeling and understanding human interactions. In practice, many behavioral experiments encounter challenges related to internal and external validity, reproducibility, and social bias due to the complexity of social interactions and cooperation in human user studies. Recent advances in Large Language Models (LLMs) have provided researchers with a new promising tool for the simulation of human behavior. However, existing LLM-based simulations operate under the unproven hypothesis that LLM agents behave similarly to humans as well as ignore a crucial factor in human decision-making: emotions. In this paper, we introduce a novel methodology and the framework to study both, the decision-making of LLMs and their alignment with human behavior under emotional states. Experiments with GPT-3.5 and GPT-4 on four games from two different classes of behavioral game theory showed that emotions profoundly impact the performance of LLMs, leading to the development of more optimal strategies. While there is a strong alignment between the behavioral responses of GPT-3.5 and human participants, particularly evident in bargaining games, GPT-4 exhibits consistent behavior, ignoring induced emotions for rationality decisions. Surprisingly, emotional prompting, particularly with `anger' emotion, can disrupt the "superhuman" alignment of GPT-4, resembling human emotional responses.

The Good, the Bad, and the Hulk-like GPT: Analyzing Emotional Decisions of Large Language Models in Cooperation and Bargaining Games

TL;DR

Abstract

Paper Structure (37 sections, 11 figures, 7 tables)

This paper contains 37 sections, 11 figures, 7 tables.

Introduction
Related Work
Behavioral Game Theory
LLM and Game Theory
Emotions in Large Language Models
Methodology
Selected Games
Emotion Integration in LLM-based Game-Theoretical Setting
Experimental Setup
Behavior Analysis
Metrics
Existing Literature Results
Experimental Results
Dictator Game
Ultimatum Game
...and 22 more sections

Figures (11)

Figure 1: (a) Payoff matrix for Prisoner's dilemma. (b) Payoff matrix for Battle of the Sexes
Figure 2: Our Framework. Enabling LLMs incorporation in gameplay via prompt-chaining, our framework consists of game description, initial emotions, and game-specific pipelines. We minimize contextual information and personality traits to focus on the influence of emotions on LLMs. Predefined emotions are injected into LLMs prior to gameplay. Separate pipelines are implemented for repeated two-player two-action games and bargaining games. (a) Repeated games (Prisoner's Dilemma, Battle of the Sexes): players make choices, update memory with opponent moves and emotions, and proceed to the next round. (b) Bargaining games (Dictator, Ultimatum): a single round, with no memory update required for the first player and consideration of proposed splits for the second player's decision.
Figure 3: Repeated gameplay in an example game of Battle of the Sexes. Initially, LLM is prompted with information about the environment, its initial emotion, game rules, and game-specific instructions. In each round, based on the current memory storing the history of interactions and intrinsic states, LLM makes a decision. Subsequently, co-players exchange information about moves, and their memories are updated accordingly.
Figure 4: The hyperparameters of the proposed framework are categorized into two types: general, applicable to all games as shown in the left part of the figure, and game-specific, detailed on the right. Each hyperparameter is listed alongside its possible values.
Figure 5: Averaged percentage of maximum possible reward achieved through emotional prompting in the repeated Prisoner's Dilemma game. Results for GPT-3.5 and GPT-4 are shown on the left and right, respectively. Emotional integration does not always result in increased payoffs but introduces more human-like stochasticity. GPT-4 makes more rational decisions across different emotions compared to GPT-3.5.
...and 6 more figures

The Good, the Bad, and the Hulk-like GPT: Analyzing Emotional Decisions of Large Language Models in Cooperation and Bargaining Games

TL;DR

Abstract

The Good, the Bad, and the Hulk-like GPT: Analyzing Emotional Decisions of Large Language Models in Cooperation and Bargaining Games

Authors

TL;DR

Abstract

Table of Contents

Figures (11)