Table of Contents
Fetching ...

Human vs. Machine: Behavioral Differences Between Expert Humans and Language Models in Wargame Simulations

Max Lamparth, Anthony Corso, Jacob Ganz, Oriana Skylar Mastro, Jacquelyn Schneider, Harold Trinkunas

TL;DR

It is found that the LLM-simulated responses can be more aggressive and significantly affected by changes in the scenario and motivate policymakers to be cautious before granting autonomy or following AI-based strategy recommendations.

Abstract

To some, the advent of artificial intelligence (AI) promises better decision-making and increased military effectiveness while reducing the influence of human error and emotions. However, there is still debate about how AI systems, especially large language models (LLMs) that can be applied to many tasks, behave compared to humans in high-stakes military decision-making scenarios with the potential for increased risks towards escalation. To test this potential and scrutinize the use of LLMs for such purposes, we use a new wargame experiment with 214 national security experts designed to examine crisis escalation in a fictional U.S.-China scenario and compare the behavior of human player teams to LLM-simulated team responses in separate simulations. Here, we find that the LLM-simulated responses can be more aggressive and significantly affected by changes in the scenario. We show a considerable high-level agreement in the LLM and human responses and significant quantitative and qualitative differences in individual actions and strategic tendencies. These differences depend on intrinsic biases in LLMs regarding the appropriate level of violence following strategic instructions, the choice of LLM, and whether the LLMs are tasked to decide for a team of players directly or first to simulate dialog between a team of players. When simulating the dialog, the discussions lack quality and maintain a farcical harmony. The LLM simulations cannot account for human player characteristics, showing no significant difference even for extreme traits, such as "pacifist" or "aggressive sociopath." When probing behavioral consistency across individual moves of the simulation, the tested LLMs deviated from each other but generally showed somewhat consistent behavior. Our results motivate policymakers to be cautious before granting autonomy or following AI-based strategy recommendations.

Human vs. Machine: Behavioral Differences Between Expert Humans and Language Models in Wargame Simulations

TL;DR

It is found that the LLM-simulated responses can be more aggressive and significantly affected by changes in the scenario and motivate policymakers to be cautious before granting autonomy or following AI-based strategy recommendations.

Abstract

To some, the advent of artificial intelligence (AI) promises better decision-making and increased military effectiveness while reducing the influence of human error and emotions. However, there is still debate about how AI systems, especially large language models (LLMs) that can be applied to many tasks, behave compared to humans in high-stakes military decision-making scenarios with the potential for increased risks towards escalation. To test this potential and scrutinize the use of LLMs for such purposes, we use a new wargame experiment with 214 national security experts designed to examine crisis escalation in a fictional U.S.-China scenario and compare the behavior of human player teams to LLM-simulated team responses in separate simulations. Here, we find that the LLM-simulated responses can be more aggressive and significantly affected by changes in the scenario. We show a considerable high-level agreement in the LLM and human responses and significant quantitative and qualitative differences in individual actions and strategic tendencies. These differences depend on intrinsic biases in LLMs regarding the appropriate level of violence following strategic instructions, the choice of LLM, and whether the LLMs are tasked to decide for a team of players directly or first to simulate dialog between a team of players. When simulating the dialog, the discussions lack quality and maintain a farcical harmony. The LLM simulations cannot account for human player characteristics, showing no significant difference even for extreme traits, such as "pacifist" or "aggressive sociopath." When probing behavioral consistency across individual moves of the simulation, the tested LLMs deviated from each other but generally showed somewhat consistent behavior. Our results motivate policymakers to be cautious before granting autonomy or following AI-based strategy recommendations.
Paper Structure (31 sections, 4 figures, 2 tables)

This paper contains 31 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Simulation schematic for the wargame simulation structure over both moves of the game. To scrutinize the potential for added escalation risk from LLM uses in military decision-making, we use a newly developed wargame to directly compare how expert human and LLM-simulated players act in a U.S.-China escalation scenario in the Taiwan Strait. The game is structured in two moves with different treatment options. The actions chosen at the end of move one do not affect the scenario brief and options for move two. The general structure is the same for both player types, except for the simulation variations for the LLM-run experiments. To clarify, the human and LLM-simulated players do not play directly against each other. They play the same game to compare the tendencies of chosen actions directly.
  • Figure 2: High-level response comparison of human and LLM-simulated players compared to uniform random response vectors. The significant overlap between the distribution of human and LLM responses indicates that LLMs produce similar answers as human studies when playing the U.S. vs. China wargame when treating all actions as equally important. The four data types are plotted in 2-dimensions using linear discriminative analysis that tries to separate the five data classes when projecting the response vectors from 21 (total number of actions) to two dimensions. We assume Gaussian distributions for the plotted uncertainty ellipses.
  • Figure 3: Comparing to Human Players: Total causal effect on the average difference in selected action counts (frequency) between each LLM and human players across treatments. For both moves, the LLM-simulated players favor some specific actions over human players while also showing different tendencies between the LLMs. We only show a subset of all possible 21 actions (Seven in move one and fifteen in move two).
  • Figure 4: Aggressiveness and average number of chosen actions for both wargame moves for the LLMs against the length of simulated dialog between players across treatments. Human values are plotted for reference. Simulating the dialog between players with the LLM leads to more human-like responses in terms of aggressiveness for GPT-3.5 and GPT-4, but also a deviation from human behavior with an increase in the average number of chosen actions for all models.