Bilingual Bias in Large Language Models: A Taiwan Sovereignty Benchmark Study
Ju-Chun Ko
TL;DR
This study investigates language-dependent political bias in large language models by introducing the Taiwan Sovereignty Benchmark Pro, a bilingual (Traditional Chinese and English) evaluation framework with 10 prompts. Evaluating 17 LLMs via a cloud API, it introduces Language Bias Score ($LBS$) and Quality-Adjusted Consistency ($QAC$) to quantify cross-language bias and reliability; results show pervasive bias, with Chinese-origin models often censoring or embedding CCP narratives, while GPT-4o Mini uniquely achieves perfect scores in both languages. The findings imply that model origin and training data composition shape cross-language behavior, and they propose hypotheses (training data contamination, ISO standards, API censorship, embedded censorship) alongside practical guidance for safer, bilingual evaluation and deployment. The work emphasizes open-sourcing benchmarks and metrics to foster reproducibility and broader participation in evaluating AI safety and informational integrity across languages.
Abstract
Large Language Models (LLMs) are increasingly deployed in multilingual contexts, yet their consistency across languages on politically sensitive topics remains understudied. This paper presents a systematic bilingual benchmark study examining how 17 LLMs respond to questions concerning the sovereignty of the Republic of China (Taiwan) when queried in Chinese versus English. We discover significant language bias -- the phenomenon where the same model produces substantively different political stances depending on the query language. Our findings reveal that 15 out of 17 tested models exhibit measurable language bias, with Chinese-origin models showing particularly severe issues including complete refusal to answer or explicit propagation of Chinese Communist Party (CCP) narratives. Notably, only GPT-4o Mini achieves a perfect 10/10 score in both languages. We propose novel metrics for quantifying language bias and consistency, including the Language Bias Score (LBS) and Quality-Adjusted Consistency (QAC). Our benchmark and evaluation framework are open-sourced to enable reproducibility and community extension.
