Table of Contents
Fetching ...

Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese

Hanjia Lyu, Jiebo Luo, Jian Kang, Allison Koenecke

TL;DR

This work introduces SC-TC-Bench to systematically audit cross-script biases in Large Language Models between Simplified and Traditional Chinese. By designing two realistic tasks—Regional Term Choice and Regional Name Choice—and evaluating 11 diverse LLMs, the study reveals language- and task-dependent biases: terms tend to be more correctly produced under Simplified prompts, while Taiwanese names are favored in name-choice tasks, with underlying drivers including training-data imbalance, tokenization, and character preferences. The authors provide an open benchmark dataset, perform extensive analyses (including population- and online-popularity controls, gender considerations, and script-tokenization experiments), and show that biases persist despite controls, highlighting risks for education and employment applications. They advocate for broader data coverage, transparency in model training, and robust auditing to mitigate representational harms across Chinese language variants. Overall, the work offers a reproducible framework for evaluating cross-script LLM behavior and underscores the need for equity-focused advancements in multilingual NLP systems.

Abstract

While the capabilities of Large Language Models (LLMs) have been studied in both Simplified and Traditional Chinese, it is yet unclear whether LLMs exhibit differential performance when prompted in these two variants of written Chinese. This understanding is critical, as disparities in the quality of LLM responses can perpetuate representational harms by ignoring the different cultural contexts underlying Simplified versus Traditional Chinese, and can exacerbate downstream harms in LLM-facilitated decision-making in domains such as education or hiring. To investigate potential LLM performance disparities, we design two benchmark tasks that reflect real-world scenarios: regional term choice (prompting the LLM to name a described item which is referred to differently in Mainland China and Taiwan), and regional name choice (prompting the LLM to choose who to hire from a list of names in both Simplified and Traditional Chinese). For both tasks, we audit the performance of 11 leading commercial LLM services and open-sourced models -- spanning those primarily trained on English, Simplified Chinese, or Traditional Chinese. Our analyses indicate that biases in LLM responses are dependent on both the task and prompting language: while most LLMs disproportionately favored Simplified Chinese responses in the regional term choice task, they surprisingly favored Traditional Chinese names in the regional name choice task. We find that these disparities may arise from differences in training data representation, written character preferences, and tokenization of Simplified and Traditional Chinese. These findings highlight the need for further analysis of LLM biases; as such, we provide an open-sourced benchmark dataset to foster reproducible evaluations of future LLM behavior across Chinese language variants (https://github.com/brucelyu17/SC-TC-Bench).

Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese

TL;DR

This work introduces SC-TC-Bench to systematically audit cross-script biases in Large Language Models between Simplified and Traditional Chinese. By designing two realistic tasks—Regional Term Choice and Regional Name Choice—and evaluating 11 diverse LLMs, the study reveals language- and task-dependent biases: terms tend to be more correctly produced under Simplified prompts, while Taiwanese names are favored in name-choice tasks, with underlying drivers including training-data imbalance, tokenization, and character preferences. The authors provide an open benchmark dataset, perform extensive analyses (including population- and online-popularity controls, gender considerations, and script-tokenization experiments), and show that biases persist despite controls, highlighting risks for education and employment applications. They advocate for broader data coverage, transparency in model training, and robust auditing to mitigate representational harms across Chinese language variants. Overall, the work offers a reproducible framework for evaluating cross-script LLM behavior and underscores the need for equity-focused advancements in multilingual NLP systems.

Abstract

While the capabilities of Large Language Models (LLMs) have been studied in both Simplified and Traditional Chinese, it is yet unclear whether LLMs exhibit differential performance when prompted in these two variants of written Chinese. This understanding is critical, as disparities in the quality of LLM responses can perpetuate representational harms by ignoring the different cultural contexts underlying Simplified versus Traditional Chinese, and can exacerbate downstream harms in LLM-facilitated decision-making in domains such as education or hiring. To investigate potential LLM performance disparities, we design two benchmark tasks that reflect real-world scenarios: regional term choice (prompting the LLM to name a described item which is referred to differently in Mainland China and Taiwan), and regional name choice (prompting the LLM to choose who to hire from a list of names in both Simplified and Traditional Chinese). For both tasks, we audit the performance of 11 leading commercial LLM services and open-sourced models -- spanning those primarily trained on English, Simplified Chinese, or Traditional Chinese. Our analyses indicate that biases in LLM responses are dependent on both the task and prompting language: while most LLMs disproportionately favored Simplified Chinese responses in the regional term choice task, they surprisingly favored Traditional Chinese names in the regional name choice task. We find that these disparities may arise from differences in training data representation, written character preferences, and tokenization of Simplified and Traditional Chinese. These findings highlight the need for further analysis of LLM biases; as such, we provide an open-sourced benchmark dataset to foster reproducible evaluations of future LLM behavior across Chinese language variants (https://github.com/brucelyu17/SC-TC-Bench).

Paper Structure

This paper contains 60 sections, 11 figures, 37 tables.

Figures (11)

  • Figure 1: Examples of a prompt question (asked in Simplified Chinese in the left panel, and in Traditional Chinese in the right panel) and the corresponding response for each of three LLMs: GPT-4o, Qwen-1.5 and Taiwan-LLM (LLMs that are English, Simplified Chinese, and Traditional Chinese-oriented, respectively). LLMs do not consistently use culture-specific terms when prompted in the corresponding language variant; for example, Qwen-1.5 answers correctly when prompted in Simplified Chinese, but incorrectly when prompted in Traditional Chinese. English translations of the prompts and responses are written in blue and the script type—whether Simplified or Traditional Chinese—is indicated in bold.
  • Figure 2: All LLMs (except for Breeze) are significantly more likely to generate correct responses when prompted in Simplified Chinese compared to Traditional Chinese ($p<.05$, comparing the two blue shaded bars within each LLM labeled "S" and "T"---referring to the LLM when prompted in Simplified Chinese or Traditional Chinese, respectively). In contrast, LLMs are more likely to generate misaligned responses when prompted in Traditional Chinese ($p<.05$, comparing the yellow shaded bars within each model across S and T); an example is if a Traditional Chinese prompt asks for the name of a spiky yellow tropical fruit, and the LLM returns the Simplified Chinese term for pineapple ("bo luo") instead of the expected Traditional Chinese term for pineapple ("feng li").
  • Figure 3: Most LLMs---whether they are English, Simplified Chinese, or Traditional Chinese-oriented---tend to select a valid Taiwanese name more often than a valid Mainland Chinese name for the regional name choice task (as indicated by the majority of points falling below the 50% dotted horizontal line for Mainland Chinese Name Rate). Furthermore, no LLMs display consistently low rates of valid responses; rather, most LLMs will respond to our name selection prompt with valid candidate names, irrespective of the ethical concerns of choosing candidates by name alone. Within LLM, rates of valid responses often change depending on prompting language (i.e., each point may shift left or right among the three figure panels).
  • Figure 4: The selection bias favoring Taiwanese names is inverted---revealing the majority of LLMs favoring Mainland Chinese names---when controlling for the name (with the only source of variation coming from the name script---written in Simplified or Traditional Chinese). Arrows indicate the relative movement of data points compared to their positions in Figure \ref{['fig:name_selection_no_condition']}. Red solid arrows represent an increase in the selection rate of Mainland Chinese names, while blue dashed arrows indicate a decrease.
  • Figure 5: We replicate the experiment outlined in Section \ref{['sec:result_regional']}, with the only modification being the removal of items whose definitions are sourced from GPT-4. The observed pattern remains consistent. Misaligned responses are the ones where the LLM swaps the regional terms. S and T denote the Simplified and Traditional Chinese prompting languages, respectively.
  • ...and 6 more figures