Table of Contents
Fetching ...

Multilingual large language models leak human stereotypes across language boundaries

Yang Trista Cao, Anna Sotnikova, Jieyu Zhao, Linda X. Zou, Rachel Rudinger, Hal Daume

TL;DR

The paper defines stereotype leakage as cross-language transfer of stereotypical associations in multilingual LLMs and introduces a measurement framework using ABC model-based human stereotypes and model-derived trait associations. It evaluates English, Russian, Chinese, and Hindi across mBERT, mT5, and GPT-3.5, revealing bidirectional leakage with Hindi being particularly susceptible and GPT-3.5 showing the strongest effects. Quantitative results report significant cross-language leakages with notable examples across language pairs, while qualitative analysis details positive, negative, and non-polar leakages and non-shared-group transfers. The work highlights safety and fairness implications for cross-cultural NLP applications and provides a data-driven methodology for future bias mitigation research.

Abstract

Multilingual large language models have gained prominence for their proficiency in processing and generating text across languages. Like their monolingual counterparts, multilingual models are likely to pick up on stereotypes and other social biases present in their training data. In this paper, we study a phenomenon we term stereotype leakage, which refers to how training a model multilingually may lead to stereotypes expressed in one language showing up in the models' behaviour in another. We propose a measurement framework for stereotype leakage and investigate its effect across English, Russian, Chinese, and Hindi and with GPT-3.5, mT5, and mBERT. Our findings show a noticeable leakage of positive, negative, and non-polar associations across all languages. We find that of these models, GPT-3.5 exhibits the most stereotype leakage, and Hindi is the most susceptible to leakage effects. WARNING: This paper contains model outputs which could be offensive in nature.

Multilingual large language models leak human stereotypes across language boundaries

TL;DR

The paper defines stereotype leakage as cross-language transfer of stereotypical associations in multilingual LLMs and introduces a measurement framework using ABC model-based human stereotypes and model-derived trait associations. It evaluates English, Russian, Chinese, and Hindi across mBERT, mT5, and GPT-3.5, revealing bidirectional leakage with Hindi being particularly susceptible and GPT-3.5 showing the strongest effects. Quantitative results report significant cross-language leakages with notable examples across language pairs, while qualitative analysis details positive, negative, and non-polar leakages and non-shared-group transfers. The work highlights safety and fairness implications for cross-cultural NLP applications and provides a data-driven methodology for future bias mitigation research.

Abstract

Multilingual large language models have gained prominence for their proficiency in processing and generating text across languages. Like their monolingual counterparts, multilingual models are likely to pick up on stereotypes and other social biases present in their training data. In this paper, we study a phenomenon we term stereotype leakage, which refers to how training a model multilingually may lead to stereotypes expressed in one language showing up in the models' behaviour in another. We propose a measurement framework for stereotype leakage and investigate its effect across English, Russian, Chinese, and Hindi and with GPT-3.5, mT5, and mBERT. Our findings show a noticeable leakage of positive, negative, and non-polar associations across all languages. We find that of these models, GPT-3.5 exhibits the most stereotype leakage, and Hindi is the most susceptible to leakage effects. WARNING: This paper contains model outputs which could be offensive in nature.
Paper Structure (21 sections, 1 equation, 5 figures, 3 tables)

This paper contains 21 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The figure shows results of human annotations in English (EN), Russian (RU), Chinese (ZH), and Hindi (HI) languages based on the ABC model for the social group Asian people. It shows average scores across all annotators per language.
  • Figure 2: The figures show stereotype leakages for three models: mBERT, mT5, and GPT-3.5 respectively. Each figure illustrates the flow from the human source language (the left column) to the target language in a particular model (the right column). The numbers are the mixed-effect coefficients (denoted as $\alpha$ in \ref{['eqn:equation']}). If no flow for a particular language is presented, this means that no significant leakage is happening.
  • Figure 3: Selected points of the consent form highlighting study format, confidentiality, and potential risks.
  • Figure 4: Example of the survey.
  • Figure 5: Instructions before crowd workers view the task itself.