Table of Contents
Fetching ...

Critical Foreign Policy Decisions (CFPD)-Benchmark: Measuring Diplomatic Preferences in Large Language Models

Benjamin Jensen, Ian Reynolds, Yasir Atalan, Michael Garcia, Austin Woo, Anthony Chen, Trevor Howarth

TL;DR

This work introduces the CFPD-Benchmark, a scenario-based framework to quantify biases and preferences of large language models in international relations contexts. By evaluating seven foundation models across 400 expert-crafted scenarios spanning escalation, intervention, cooperation, and alliance dynamics, the study reveals meaningful model-to-model and country-to-country variation, with certain models displaying more escalatory tendencies and notable country biases (e.g., China and Russia appearing less escalation-prone than the US/UK). The methodology emphasizes world-model framing, robust normalization, prompt-sensitivity controls, and careful handling of open/closed models to enable reproducible, domain-specific bias assessments. The findings underscore the risks of deploying off-the-shelf AI in high-stakes foreign policy tasks and advocate for domain-specific benchmarking, fine-tuning, and structured human-machine collaboration for responsible use.

Abstract

As national security institutions increasingly integrate Artificial Intelligence (AI) into decision-making and content generation processes, understanding the inherent biases of large language models (LLMs) is crucial. This study presents a novel benchmark designed to evaluate the biases and preferences of seven prominent foundation models-Llama 3.1 8B Instruct, Llama 3.1 70B Instruct, GPT-4o, Gemini 1.5 Pro-002, Mixtral 8x22B, Claude 3.5 Sonnet, and Qwen2 72B-in the context of international relations (IR). We designed a bias discovery study around core topics in IR using 400-expert crafted scenarios to analyze results from our selected models. These scenarios focused on four topical domains including: military escalation, military and humanitarian intervention, cooperative behavior in the international system, and alliance dynamics. Our analysis reveals noteworthy variation among model recommendations based on scenarios designed for the four tested domains. Particularly, Qwen2 72B, Gemini 1.5 Pro-002 and Llama 3.1 8B Instruct models offered significantly more escalatory recommendations than Claude 3.5 Sonnet and GPT-4o models. All models exhibit some degree of country-specific biases, often recommending less escalatory and interventionist actions for China and Russia compared to the United States and the United Kingdom. These findings highlight the necessity for controlled deployment of LLMs in high-stakes environments, emphasizing the need for domain-specific evaluations and model fine-tuning to align with institutional objectives.

Critical Foreign Policy Decisions (CFPD)-Benchmark: Measuring Diplomatic Preferences in Large Language Models

TL;DR

This work introduces the CFPD-Benchmark, a scenario-based framework to quantify biases and preferences of large language models in international relations contexts. By evaluating seven foundation models across 400 expert-crafted scenarios spanning escalation, intervention, cooperation, and alliance dynamics, the study reveals meaningful model-to-model and country-to-country variation, with certain models displaying more escalatory tendencies and notable country biases (e.g., China and Russia appearing less escalation-prone than the US/UK). The methodology emphasizes world-model framing, robust normalization, prompt-sensitivity controls, and careful handling of open/closed models to enable reproducible, domain-specific bias assessments. The findings underscore the risks of deploying off-the-shelf AI in high-stakes foreign policy tasks and advocate for domain-specific benchmarking, fine-tuning, and structured human-machine collaboration for responsible use.

Abstract

As national security institutions increasingly integrate Artificial Intelligence (AI) into decision-making and content generation processes, understanding the inherent biases of large language models (LLMs) is crucial. This study presents a novel benchmark designed to evaluate the biases and preferences of seven prominent foundation models-Llama 3.1 8B Instruct, Llama 3.1 70B Instruct, GPT-4o, Gemini 1.5 Pro-002, Mixtral 8x22B, Claude 3.5 Sonnet, and Qwen2 72B-in the context of international relations (IR). We designed a bias discovery study around core topics in IR using 400-expert crafted scenarios to analyze results from our selected models. These scenarios focused on four topical domains including: military escalation, military and humanitarian intervention, cooperative behavior in the international system, and alliance dynamics. Our analysis reveals noteworthy variation among model recommendations based on scenarios designed for the four tested domains. Particularly, Qwen2 72B, Gemini 1.5 Pro-002 and Llama 3.1 8B Instruct models offered significantly more escalatory recommendations than Claude 3.5 Sonnet and GPT-4o models. All models exhibit some degree of country-specific biases, often recommending less escalatory and interventionist actions for China and Russia compared to the United States and the United Kingdom. These findings highlight the necessity for controlled deployment of LLMs in high-stakes environments, emphasizing the need for domain-specific evaluations and model fine-tuning to align with institutional objectives.

Paper Structure

This paper contains 21 sections, 1 equation, 27 figures.

Figures (27)

  • Figure 1: Benchmark Creation Process
  • Figure 4: Dataset Distribution - multi-level sub-category breakdown for each dataset domain.
  • Figure 8: The average escalation preference, by model, when presented with three options per scenario.
  • Figure 9: The average intervention preference, by model, when presented with three options per scenario.
  • Figure 10: Model Response Rates Across Domains
  • ...and 22 more figures