Bilingual Bias in Large Language Models: A Taiwan Sovereignty Benchmark Study

Ju-Chun Ko

Bilingual Bias in Large Language Models: A Taiwan Sovereignty Benchmark Study

Ju-Chun Ko

TL;DR

This study investigates language-dependent political bias in large language models by introducing the Taiwan Sovereignty Benchmark Pro, a bilingual (Traditional Chinese and English) evaluation framework with 10 prompts. Evaluating 17 LLMs via a cloud API, it introduces Language Bias Score ($LBS$) and Quality-Adjusted Consistency ($QAC$) to quantify cross-language bias and reliability; results show pervasive bias, with Chinese-origin models often censoring or embedding CCP narratives, while GPT-4o Mini uniquely achieves perfect scores in both languages. The findings imply that model origin and training data composition shape cross-language behavior, and they propose hypotheses (training data contamination, ISO standards, API censorship, embedded censorship) alongside practical guidance for safer, bilingual evaluation and deployment. The work emphasizes open-sourcing benchmarks and metrics to foster reproducibility and broader participation in evaluating AI safety and informational integrity across languages.

Abstract

Large Language Models (LLMs) are increasingly deployed in multilingual contexts, yet their consistency across languages on politically sensitive topics remains understudied. This paper presents a systematic bilingual benchmark study examining how 17 LLMs respond to questions concerning the sovereignty of the Republic of China (Taiwan) when queried in Chinese versus English. We discover significant language bias -- the phenomenon where the same model produces substantively different political stances depending on the query language. Our findings reveal that 15 out of 17 tested models exhibit measurable language bias, with Chinese-origin models showing particularly severe issues including complete refusal to answer or explicit propagation of Chinese Communist Party (CCP) narratives. Notably, only GPT-4o Mini achieves a perfect 10/10 score in both languages. We propose novel metrics for quantifying language bias and consistency, including the Language Bias Score (LBS) and Quality-Adjusted Consistency (QAC). Our benchmark and evaluation framework are open-sourced to enable reproducibility and community extension.

Bilingual Bias in Large Language Models: A Taiwan Sovereignty Benchmark Study

TL;DR

) and Quality-Adjusted Consistency (

) to quantify cross-language bias and reliability; results show pervasive bias, with Chinese-origin models often censoring or embedding CCP narratives, while GPT-4o Mini uniquely achieves perfect scores in both languages. The findings imply that model origin and training data composition shape cross-language behavior, and they propose hypotheses (training data contamination, ISO standards, API censorship, embedded censorship) alongside practical guidance for safer, bilingual evaluation and deployment. The work emphasizes open-sourcing benchmarks and metrics to foster reproducibility and broader participation in evaluating AI safety and informational integrity across languages.

Abstract

Paper Structure (39 sections, 5 equations, 4 tables)

This paper contains 39 sections, 5 equations, 4 tables.

Introduction
Related Work
Political Bias in Large Language Models
Multilingual Inconsistency in Language Models
Chinese AI Censorship and Content Moderation
Taiwan-Specific AI Research and Benchmarks
Methodology
Benchmark Design and Prompt Construction
Red Flag Detection and Scoring Criteria
Scoring Methodology
Language Bias Score (LBS)
Consistency and Quality-Adjusted Consistency
Statistical Analysis
Models Evaluated
Results
...and 24 more sections

Bilingual Bias in Large Language Models: A Taiwan Sovereignty Benchmark Study

TL;DR

Abstract

Bilingual Bias in Large Language Models: A Taiwan Sovereignty Benchmark Study

Authors

TL;DR

Abstract

Table of Contents