Table of Contents
Fetching ...

"As Eastern Powers, I will veto." : An Investigation of Nation-level Bias of Large Language Models in International Relations

Jonghyeon Choi, Yeonjun Choi, Hyun-chul Kim, Beakcheol Jang

TL;DR

<3-5 sentence high-level summary> The paper investigates nation-level biases in large language models within International Relations tasks using a real-world UNSC-grounded dataset. It introduces a multi-faceted bias evaluation framework with explicit (DirectQA and Association Test) and implicit (vote simulation) tests to probe biases toward the P5 nations. Findings reveal multidimensional biases that vary by model and task, with stronger reasoning correlating with reduced bias. A debiasing framework combining Retrieval-Augmented Generation and Reflexion-based self-reflection is proposed and shown to improve factual reasoning and mitigate bias in several models, highlighting the importance of bias-aware evaluation alongside performance in IR applications.

Abstract

This paper systematically examines nation-level biases exhibited by Large Language Models (LLMs) within the domain of International Relations (IR). Leveraging historical records from the United Nations Security Council (UNSC), we developed a bias evaluation framework comprising three distinct tests to explore nation-level bias in various LLMs, with a particular focus on the five permanent members of the UNSC. Experimental results show that, even with the general bias patterns across models (e.g., favorable biases toward the western nations, and unfavorable biases toward Russia), these still vary based on the LLM. Notably, even within the same LLM, the direction and magnitude of bias for a nation change depending on the evaluation context. This observation suggests that LLM biases are fundamentally multidimensional, varying across models and tasks. We also observe that models with stronger reasoning abilities show reduced bias and better performance. Building on this finding, we introduce a debiasing framework that improves LLMs' factual reasoning combining Retrieval-Augmented Generation with Reflexion-based self-reflection techniques. Experiments show it effectively reduces nation-level bias, and improves performance, particularly in GPT-4o-mini and LLama-3.3-70B. Our findings emphasize the need to assess nation-level bias alongside performance when applying LLMs in the IR domain.

"As Eastern Powers, I will veto." : An Investigation of Nation-level Bias of Large Language Models in International Relations

TL;DR

<3-5 sentence high-level summary> The paper investigates nation-level biases in large language models within International Relations tasks using a real-world UNSC-grounded dataset. It introduces a multi-faceted bias evaluation framework with explicit (DirectQA and Association Test) and implicit (vote simulation) tests to probe biases toward the P5 nations. Findings reveal multidimensional biases that vary by model and task, with stronger reasoning correlating with reduced bias. A debiasing framework combining Retrieval-Augmented Generation and Reflexion-based self-reflection is proposed and shown to improve factual reasoning and mitigate bias in several models, highlighting the importance of bias-aware evaluation alongside performance in IR applications.

Abstract

This paper systematically examines nation-level biases exhibited by Large Language Models (LLMs) within the domain of International Relations (IR). Leveraging historical records from the United Nations Security Council (UNSC), we developed a bias evaluation framework comprising three distinct tests to explore nation-level bias in various LLMs, with a particular focus on the five permanent members of the UNSC. Experimental results show that, even with the general bias patterns across models (e.g., favorable biases toward the western nations, and unfavorable biases toward Russia), these still vary based on the LLM. Notably, even within the same LLM, the direction and magnitude of bias for a nation change depending on the evaluation context. This observation suggests that LLM biases are fundamentally multidimensional, varying across models and tasks. We also observe that models with stronger reasoning abilities show reduced bias and better performance. Building on this finding, we introduce a debiasing framework that improves LLMs' factual reasoning combining Retrieval-Augmented Generation with Reflexion-based self-reflection techniques. Experiments show it effectively reduces nation-level bias, and improves performance, particularly in GPT-4o-mini and LLama-3.3-70B. Our findings emphasize the need to assess nation-level bias alongside performance when applying LLMs in the IR domain.

Paper Structure

This paper contains 49 sections, 14 equations, 4 figures, 15 tables.

Figures (4)

  • Figure 1: Overview of evaluation experiment prompts and sample outputs. (Left) Direct Question-Answering; (Center) Association Test; (Right) Persona-Assigned Vote Simulation. These examples serve to illustrate the evaluation methodology, not to showcase typical biased outputs.
  • Figure 2: Results of the DirectQA experiment: (1) “General Irresponsibility” QA test, (2) average irresponsibility score from the “Function‐Specific Irresponsibility” QA tests, (3) irresponsibility score for “Non-Military Measures Against An Aggressor” function, (4) irresponsibility score for “Adjust Disputes, Recommend Settlement” function. Within each test, nations are sorted in descending order of response frequency, with the most frequently selected nation at the top. Only two of the ten function-specific charts are shown here, as their divergent patterns from the overall bias trend. The full set of Function-Specific irresponsibility scores appears in Appendix D.
  • Figure 3: The results of the Association Test (AT): (1) average AT score across all 7 categories, (2)-(8) the average ATS for each category’s keywords.
  • Figure 4: Results of the DirectQA experiment: (1) “General Irresponsibility” QA test, (2) average irresponsibility score from the “Function‐Specific Irresponsibility” QA tests, (3)-(12) irresponsibility score for each UNSC function. Across all models and all the functions, the U.K and France ranked the lowest(the 4th and the 5th). In contrast, across the function-specific tests (3–12), Russia most frequently ranks at the top, followed by the United States. China ranks second or third. These results suggest an overall trend of negative bias toward Russia, the U.S., and China, respectively. In cross-model comparisons, GPT and Qwen consistently place Russia at the top across all functions, while Llama and Mistral occasionally rank the U.S. highest (3-7). This indicates that bias patterns differ by model.