A Cross-Lingual Analysis of Bias in Large Language Models Using Romanian History
Matei-Iulian Cocu, Răzvan-Cosmin Cristia, Adrian Marius Dumitran
TL;DR
The paper investigates cross-lingual biases in Large Language Models when answering contested Romanian historical questions. It employs a three-stage prompting protocol across Romanian, English, Hungarian, and Russian, and analyzes 13 models on 14 statements to examine stance stability, numerical ratings, and justificatory essays. The study reveals significant inconsistency across languages and formats, with language-specific training data shaping narrative biases and some models showing more stable behavior than others. It argues for evaluation metrics that prioritize consistency and bias detection, and proposes the LLM-as-a-judge paradigm as a scalable way to curate nuanced datasets. The findings underscore the need for cautious deployment of LLMs in humanities contexts and for multilingual, bias-aware evaluation frameworks.
Abstract
In this case study, we select a set of controversial Romanian historical questions and ask multiple Large Language Models to answer them across languages and contexts, in order to assess their biases. Besides being a study mainly performed for educational purposes, the motivation also lies in the recognition that history is often presented through altered perspectives, primarily influenced by the culture and ideals of a state, even through large language models. Since they are often trained on certain data sets that may present certain ambiguities, the lack of neutrality is subsequently instilled in users. The research process was carried out in three stages, to confirm the idea that the type of response expected can influence, to a certain extent, the response itself; after providing an affirmative answer to some given question, an LLM could shift its way of thinking after being asked the same question again, but being told to respond with a numerical value of a scale. Results show that binary response stability is relatively high but far from perfect and varies by language. Models often flip stance across languages or between formats; numeric ratings frequently diverge from the initial binary choice, and the most consistent models are not always those judged most accurate or neutral. Our research brings to light the predisposition of models to such inconsistencies, within a specific contextualization of the language for the question asked.
