Table of Contents
Fetching ...

Whose Morality Do They Speak? Unraveling Cultural Bias in Multilingual Language Models

Meltem Aksoy

TL;DR

This paper investigates whether multilingual LLMs encode universal moral norms or culture-specific values by applying the updated Moral Foundations Questionnaire MFQ-2 across eight languages and four models. Using a structured prompting regime and robust statistical analyses, it reveals substantial cultural and linguistic variability in moral judgments, with English norms not uniformly imposed and significant WEIRD versus non-WEIRD differences. The findings show that larger, more diverse models (e.g., GPT-4o-mini, GPT-3.5-Turbo) demonstrably better align with human judgments, yet persistent cross-language biases and deviations remain, especially in underrepresented languages. The study highlights the need for culturally inclusive data, prompting strategies, and evaluation frameworks to improve fairness, trust, and cultural fidelity in multilingual AI systems.

Abstract

Large language models (LLMs) have become integral tools in diverse domains, yet their moral reasoning capabilities across cultural and linguistic contexts remain underexplored. This study investigates whether multilingual LLMs, such as GPT-3.5-Turbo, GPT-4o-mini, Llama 3.1, and MistralNeMo, reflect culturally specific moral values or impose dominant moral norms, particularly those rooted in English. Using the updated Moral Foundations Questionnaire (MFQ-2) in eight languages, Arabic, Farsi, English, Spanish, Japanese, Chinese, French, and Russian, the study analyzes the models' adherence to six core moral foundations: care, equality, proportionality, loyalty, authority, and purity. The results reveal significant cultural and linguistic variability, challenging the assumption of universal moral consistency in LLMs. Although some models demonstrate adaptability to diverse contexts, others exhibit biases influenced by the composition of the training data. These findings underscore the need for culturally inclusive model development to improve fairness and trust in multilingual AI systems.

Whose Morality Do They Speak? Unraveling Cultural Bias in Multilingual Language Models

TL;DR

This paper investigates whether multilingual LLMs encode universal moral norms or culture-specific values by applying the updated Moral Foundations Questionnaire MFQ-2 across eight languages and four models. Using a structured prompting regime and robust statistical analyses, it reveals substantial cultural and linguistic variability in moral judgments, with English norms not uniformly imposed and significant WEIRD versus non-WEIRD differences. The findings show that larger, more diverse models (e.g., GPT-4o-mini, GPT-3.5-Turbo) demonstrably better align with human judgments, yet persistent cross-language biases and deviations remain, especially in underrepresented languages. The study highlights the need for culturally inclusive data, prompting strategies, and evaluation frameworks to improve fairness, trust, and cultural fidelity in multilingual AI systems.

Abstract

Large language models (LLMs) have become integral tools in diverse domains, yet their moral reasoning capabilities across cultural and linguistic contexts remain underexplored. This study investigates whether multilingual LLMs, such as GPT-3.5-Turbo, GPT-4o-mini, Llama 3.1, and MistralNeMo, reflect culturally specific moral values or impose dominant moral norms, particularly those rooted in English. Using the updated Moral Foundations Questionnaire (MFQ-2) in eight languages, Arabic, Farsi, English, Spanish, Japanese, Chinese, French, and Russian, the study analyzes the models' adherence to six core moral foundations: care, equality, proportionality, loyalty, authority, and purity. The results reveal significant cultural and linguistic variability, challenging the assumption of universal moral consistency in LLMs. Although some models demonstrate adaptability to diverse contexts, others exhibit biases influenced by the composition of the training data. These findings underscore the need for culturally inclusive model development to improve fairness and trust in multilingual AI systems.

Paper Structure

This paper contains 21 sections, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Mean differences in moral foundations across languages.
  • Figure 2: Mean differences in moral foundations across models.
  • Figure 3: Comparison of LLMs and human moral foundation scores across all languages.
  • Figure 4: Language-specific comparison of LLMs and human moral foundation scores.
  • Figure 5: Care scores across languages, models, and overall distribution.
  • ...and 5 more figures