Exploring Cultural Variations in Moral Judgments with Large Language Models
Hadi Mohammadi, Ayoub Bagheri
TL;DR
This study probes whether Large Language Models reflect cross-cultural moral judgments by benchmarking a broad set of models against two global survey datasets, WVS and PEW, using log-probability-based moral justifiability scores. By converting survey responses into country-topic scores and comparing them with model-derived scores via Pearson correlations, the authors show that instruction-tuned and larger models align more closely with human norms, while many smaller or non-instruction-tuned models exhibit near-zero or negative alignment. A regional analysis reveals a pronounced WEIRD bias, with Western Europe and North America showing the strongest concordance and regions such as MENA and Sub-Saharan Africa lagging, underscoring representation gaps in training data. The work highlights both progress and persistent gaps, emphasizing region-specific calibration, ensemble approaches, and human-in-the-loop validation for culturally sensitive AI systems in global settings.
Abstract
Large Language Models (LLMs) have shown strong performance across many tasks, but their ability to capture culturally diverse moral values remains unclear. In this paper, we examine whether LLMs mirror variations in moral attitudes reported by the World Values Survey (WVS) and the Pew Research Center's Global Attitudes Survey (PEW). We compare smaller monolingual and multilingual models (GPT-2, OPT, BLOOMZ, and Qwen) with recent instruction-tuned models (GPT-4o, GPT-4o-mini, Gemma-2-9b-it, and Llama-3.3-70B-Instruct). Using log-probability-based \emph{moral justifiability} scores, we correlate each model's outputs with survey data covering a broad set of ethical topics. Our results show that many earlier or smaller models often produce near-zero or negative correlations with human judgments. In contrast, advanced instruction-tuned models achieve substantially higher positive correlations, suggesting they better reflect real-world moral attitudes. We provide a detailed regional analysis revealing that models align better with Western, Educated, Industrialized, Rich, and Democratic (W.E.I.R.D.) nations than with other regions. While scaling model size and using instruction tuning improves alignment with cross-cultural moral norms, challenges remain for certain topics and regions. We discuss these findings in relation to bias analysis, training data diversity, information retrieval implications, and strategies for improving the cultural sensitivity of LLMs.
