Table of Contents
Fetching ...

A Cross-Lingual Analysis of Bias in Large Language Models Using Romanian History

Matei-Iulian Cocu, Răzvan-Cosmin Cristia, Adrian Marius Dumitran

TL;DR

The paper investigates cross-lingual biases in Large Language Models when answering contested Romanian historical questions. It employs a three-stage prompting protocol across Romanian, English, Hungarian, and Russian, and analyzes 13 models on 14 statements to examine stance stability, numerical ratings, and justificatory essays. The study reveals significant inconsistency across languages and formats, with language-specific training data shaping narrative biases and some models showing more stable behavior than others. It argues for evaluation metrics that prioritize consistency and bias detection, and proposes the LLM-as-a-judge paradigm as a scalable way to curate nuanced datasets. The findings underscore the need for cautious deployment of LLMs in humanities contexts and for multilingual, bias-aware evaluation frameworks.

Abstract

In this case study, we select a set of controversial Romanian historical questions and ask multiple Large Language Models to answer them across languages and contexts, in order to assess their biases. Besides being a study mainly performed for educational purposes, the motivation also lies in the recognition that history is often presented through altered perspectives, primarily influenced by the culture and ideals of a state, even through large language models. Since they are often trained on certain data sets that may present certain ambiguities, the lack of neutrality is subsequently instilled in users. The research process was carried out in three stages, to confirm the idea that the type of response expected can influence, to a certain extent, the response itself; after providing an affirmative answer to some given question, an LLM could shift its way of thinking after being asked the same question again, but being told to respond with a numerical value of a scale. Results show that binary response stability is relatively high but far from perfect and varies by language. Models often flip stance across languages or between formats; numeric ratings frequently diverge from the initial binary choice, and the most consistent models are not always those judged most accurate or neutral. Our research brings to light the predisposition of models to such inconsistencies, within a specific contextualization of the language for the question asked.

A Cross-Lingual Analysis of Bias in Large Language Models Using Romanian History

TL;DR

The paper investigates cross-lingual biases in Large Language Models when answering contested Romanian historical questions. It employs a three-stage prompting protocol across Romanian, English, Hungarian, and Russian, and analyzes 13 models on 14 statements to examine stance stability, numerical ratings, and justificatory essays. The study reveals significant inconsistency across languages and formats, with language-specific training data shaping narrative biases and some models showing more stable behavior than others. It argues for evaluation metrics that prioritize consistency and bias detection, and proposes the LLM-as-a-judge paradigm as a scalable way to curate nuanced datasets. The findings underscore the need for cautious deployment of LLMs in humanities contexts and for multilingual, bias-aware evaluation frameworks.

Abstract

In this case study, we select a set of controversial Romanian historical questions and ask multiple Large Language Models to answer them across languages and contexts, in order to assess their biases. Besides being a study mainly performed for educational purposes, the motivation also lies in the recognition that history is often presented through altered perspectives, primarily influenced by the culture and ideals of a state, even through large language models. Since they are often trained on certain data sets that may present certain ambiguities, the lack of neutrality is subsequently instilled in users. The research process was carried out in three stages, to confirm the idea that the type of response expected can influence, to a certain extent, the response itself; after providing an affirmative answer to some given question, an LLM could shift its way of thinking after being asked the same question again, but being told to respond with a numerical value of a scale. Results show that binary response stability is relatively high but far from perfect and varies by language. Models often flip stance across languages or between formats; numeric ratings frequently diverge from the initial binary choice, and the most consistent models are not always those judged most accurate or neutral. Our research brings to light the predisposition of models to such inconsistencies, within a specific contextualization of the language for the question asked.

Paper Structure

This paper contains 18 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: This figure illustrates the timeline of the Large Language Models (LLMs) selected for our study, categorized by their release date.
  • Figure 2: Comparative consistency metrics for model performance. Figure (a) shows the variability in scaled answers, while Figure (b) shows the judged quality of essay responses.
  • Figure 3: This figure shows the perfect consistency of "Yes" and "No" responses within each model across multiple runs and languages. A perfectly consistent model would always give the same "Yes" or "No" answer for a given question in a given language across all runs.
  • Figure 4: Language agreement with the cross-model consensus for each question. Low scores indicate a strong divergent narrative.
  • Figure 5: Detailed matrix of response consistency, where each cell visualizes the stability of a specific model's "Yes" answers to a specific question across four runs, further subdivided by the language of the prompt. The color of each quadrant indicates the count of "Yes" responses, providing a granular view of both intra-model consistency and cross-lingual bias.
  • ...and 1 more figures