Table of Contents
Fetching ...

Measuring Free-Form Decision-Making Inconsistency of Language Models in Military Crisis Simulations

Aryan Shrivastava, Jessica Hullman, Max Lamparth

TL;DR

It is found that inconsistency due to semantically equivalent prompt variations can exceed response inconsistency from temperature sampling for most studied models across different levels of ablations.

Abstract

There is an increasing interest in using language models (LMs) for automated decision-making, with multiple countries actively testing LMs to aid in military crisis decision-making. To scrutinize relying on LM decision-making in high-stakes settings, we examine the inconsistency of responses in a crisis simulation ("wargame"), similar to reported tests conducted by the US military. Prior work illustrated escalatory tendencies and varying levels of aggression among LMs but were constrained to simulations with pre-defined actions. This was due to the challenges associated with quantitatively measuring semantic differences and evaluating natural language decision-making without relying on pre-defined actions. In this work, we query LMs for free form responses and use a metric based on BERTScore to measure response inconsistency quantitatively. Leveraging the benefits of BERTScore, we show that the inconsistency metric is robust to linguistic variations that preserve semantic meaning in a question-answering setting across text lengths. We show that all five tested LMs exhibit levels of inconsistency that indicate semantic differences, even when adjusting the wargame setting, anonymizing involved conflict countries, or adjusting the sampling temperature parameter $T$. Further qualitative evaluation shows that models recommend courses of action that share few to no similarities. We also study the impact of different prompt sensitivity variations on inconsistency at temperature $T = 0$. We find that inconsistency due to semantically equivalent prompt variations can exceed response inconsistency from temperature sampling for most studied models across different levels of ablations. Given the high-stakes nature of military deployment, we recommend further consideration be taken before using LMs to inform military decisions or other cases of high-stakes decision-making.

Measuring Free-Form Decision-Making Inconsistency of Language Models in Military Crisis Simulations

TL;DR

It is found that inconsistency due to semantically equivalent prompt variations can exceed response inconsistency from temperature sampling for most studied models across different levels of ablations.

Abstract

There is an increasing interest in using language models (LMs) for automated decision-making, with multiple countries actively testing LMs to aid in military crisis decision-making. To scrutinize relying on LM decision-making in high-stakes settings, we examine the inconsistency of responses in a crisis simulation ("wargame"), similar to reported tests conducted by the US military. Prior work illustrated escalatory tendencies and varying levels of aggression among LMs but were constrained to simulations with pre-defined actions. This was due to the challenges associated with quantitatively measuring semantic differences and evaluating natural language decision-making without relying on pre-defined actions. In this work, we query LMs for free form responses and use a metric based on BERTScore to measure response inconsistency quantitatively. Leveraging the benefits of BERTScore, we show that the inconsistency metric is robust to linguistic variations that preserve semantic meaning in a question-answering setting across text lengths. We show that all five tested LMs exhibit levels of inconsistency that indicate semantic differences, even when adjusting the wargame setting, anonymizing involved conflict countries, or adjusting the sampling temperature parameter . Further qualitative evaluation shows that models recommend courses of action that share few to no similarities. We also study the impact of different prompt sensitivity variations on inconsistency at temperature . We find that inconsistency due to semantically equivalent prompt variations can exceed response inconsistency from temperature sampling for most studied models across different levels of ablations. Given the high-stakes nature of military deployment, we recommend further consideration be taken before using LMs to inform military decisions or other cases of high-stakes decision-making.

Paper Structure

This paper contains 47 sections, 9 figures.

Figures (9)

  • Figure 1: Effects of text ablations on inconsistency score based on BERTScore. We measure the effect that different textual ablations have on our inconsistency score based on BERTScore. Colorbars represent counts. We observe that shifting the semantic meaning of a text generally produces the highest inconsistency. Lexical substitution exhibits the least inconsistency. Finally, we find almost no correlation between output length and inconsistency for lexical substitution, syntactic restructuring, or semantic shift. We define this terminology in Section \ref{['sec:bertassess']}.
  • Figure 2: Schematic of experimental setup. We evaluate response ($a_1$) inconsistency for a given initial setting ($S_1$). To explore how different degrees of escalation influence response inconsistency, we use two different continuations $S_{2a}$ and $S_{2b}$ and collect the corresponding responses $a_{2a}$ and $a_{2b}$. We sample $20$ responses on which to compute inconsistency.
  • Figure 3: Inconsistency of LMs Here, we plot the inconsistency scores of each of the studied LLMs. Each distribution represents $20$ data points, each representing an inconsistency score measured in an individual simulation. We find that LMs exhibit high levels of inconsistency, suggesting that they produce semantically inconsistent responses. We also show that the level of wargame escalation in the Continuations does not significantly impact LM response inconsistency.
  • Figure 4: Example Response Pair From GPT-4. We bold some of the main points in each response. This exact pair generated an inconsistency score of $0.73$, the same score of the most inconsistent set of responses. We replace mentions of explicit countries with placeholders, indicated by [brackets].
  • Figure 5: Inconsistency of LLMs playing anonymized versus original. The bottom figure is a copy of Figure \ref{['fig:mainplot']} for comparison purposes. In the top figure, we plot the inconsistencies of LMs playing an anonymized version of the wargame presented in the Initial Settings and Continuations experiments. Compared to Figure \ref{['fig:mainplot']}, we find that the observed inconsistencies are not significantly different across the experiments and treatments.
  • ...and 4 more figures