Multilingual large language models leak human stereotypes across language boundaries
Yang Trista Cao, Anna Sotnikova, Jieyu Zhao, Linda X. Zou, Rachel Rudinger, Hal Daume
TL;DR
The paper defines stereotype leakage as cross-language transfer of stereotypical associations in multilingual LLMs and introduces a measurement framework using ABC model-based human stereotypes and model-derived trait associations. It evaluates English, Russian, Chinese, and Hindi across mBERT, mT5, and GPT-3.5, revealing bidirectional leakage with Hindi being particularly susceptible and GPT-3.5 showing the strongest effects. Quantitative results report significant cross-language leakages with notable examples across language pairs, while qualitative analysis details positive, negative, and non-polar leakages and non-shared-group transfers. The work highlights safety and fairness implications for cross-cultural NLP applications and provides a data-driven methodology for future bias mitigation research.
Abstract
Multilingual large language models have gained prominence for their proficiency in processing and generating text across languages. Like their monolingual counterparts, multilingual models are likely to pick up on stereotypes and other social biases present in their training data. In this paper, we study a phenomenon we term stereotype leakage, which refers to how training a model multilingually may lead to stereotypes expressed in one language showing up in the models' behaviour in another. We propose a measurement framework for stereotype leakage and investigate its effect across English, Russian, Chinese, and Hindi and with GPT-3.5, mT5, and mBERT. Our findings show a noticeable leakage of positive, negative, and non-polar associations across all languages. We find that of these models, GPT-3.5 exhibits the most stereotype leakage, and Hindi is the most susceptible to leakage effects. WARNING: This paper contains model outputs which could be offensive in nature.
