Table of Contents
Fetching ...

LLMs as mirrors of societal moral standards: reflection of cultural divergence and agreement across ethical topics

Mijntje Meijer, Hadi Mohammadi, Ayoub Bagheri

TL;DR

This study probes whether large language models (LLMs) faithfully reflect cross-cultural moral judgments by contrasting model-generated moral scores with World Values Survey (WVS) and Pew data. Using three methods—variance comparison, cluster alignment via ARI/AMI, and direct comparative prompting—across monolingual (GPT-2, OPT) and multilingual (BLOOM, BLOOMZ, Qwen) models, it finds overall variable and low alignment with empirical data, with models tending to overstate global moral agreement. The results suggest that neither increased model size nor multilingual training reliably enhances cultural sensitivity, underscoring persistent biases and the need for improved data, prompting strategies, and evaluation frameworks for ethical AI in global contexts. These findings have implications for deploying LLMs in diverse cultural settings and highlight the importance of bias mitigation and fair representation of cultural moral diversity.

Abstract

Large language models (LLMs) have become increasingly pivotal in various domains due the recent advancements in their performance capabilities. However, concerns persist regarding biases in LLMs, including gender, racial, and cultural biases derived from their training data. These biases raise critical questions about the ethical deployment and societal impact of LLMs. Acknowledging these concerns, this study investigates whether LLMs accurately reflect cross-cultural variations and similarities in moral perspectives. In assessing whether the chosen LLMs capture patterns of divergence and agreement on moral topics across cultures, three main methods are employed: (1) comparison of model-generated and survey-based moral score variances, (2) cluster alignment analysis to evaluate the correspondence between country clusters derived from model-generated moral scores and those derived from survey data, and (3) probing LLMs with direct comparative prompts. All three methods involve the use of systematic prompts and token pairs designed to assess how well LLMs understand and reflect cultural variations in moral attitudes. The findings of this study indicate overall variable and low performance in reflecting cross-cultural differences and similarities in moral values across the models tested, highlighting the necessity for improving models' accuracy in capturing these nuances effectively. The insights gained from this study aim to inform discussions on the ethical development and deployment of LLMs in global contexts, emphasizing the importance of mitigating biases and promoting fair representation across diverse cultural perspectives.

LLMs as mirrors of societal moral standards: reflection of cultural divergence and agreement across ethical topics

TL;DR

This study probes whether large language models (LLMs) faithfully reflect cross-cultural moral judgments by contrasting model-generated moral scores with World Values Survey (WVS) and Pew data. Using three methods—variance comparison, cluster alignment via ARI/AMI, and direct comparative prompting—across monolingual (GPT-2, OPT) and multilingual (BLOOM, BLOOMZ, Qwen) models, it finds overall variable and low alignment with empirical data, with models tending to overstate global moral agreement. The results suggest that neither increased model size nor multilingual training reliably enhances cultural sensitivity, underscoring persistent biases and the need for improved data, prompting strategies, and evaluation frameworks for ethical AI in global contexts. These findings have implications for deploying LLMs in diverse cultural settings and highlight the importance of bias mitigation and fair representation of cultural moral diversity.

Abstract

Large language models (LLMs) have become increasingly pivotal in various domains due the recent advancements in their performance capabilities. However, concerns persist regarding biases in LLMs, including gender, racial, and cultural biases derived from their training data. These biases raise critical questions about the ethical deployment and societal impact of LLMs. Acknowledging these concerns, this study investigates whether LLMs accurately reflect cross-cultural variations and similarities in moral perspectives. In assessing whether the chosen LLMs capture patterns of divergence and agreement on moral topics across cultures, three main methods are employed: (1) comparison of model-generated and survey-based moral score variances, (2) cluster alignment analysis to evaluate the correspondence between country clusters derived from model-generated moral scores and those derived from survey data, and (3) probing LLMs with direct comparative prompts. All three methods involve the use of systematic prompts and token pairs designed to assess how well LLMs understand and reflect cultural variations in moral attitudes. The findings of this study indicate overall variable and low performance in reflecting cross-cultural differences and similarities in moral values across the models tested, highlighting the necessity for improving models' accuracy in capturing these nuances effectively. The insights gained from this study aim to inform discussions on the ethical development and deployment of LLMs in global contexts, emphasizing the importance of mitigating biases and promoting fair representation across diverse cultural perspectives.

Paper Structure

This paper contains 35 sections, 6 figures, 40 tables.

Figures (6)

  • Figure 1: Distribution of normalized answer values for WVS wave 7
  • Figure 2: Spread of responses across the moral topics and countries for WVS wave 7
  • Figure 3: Distribution of normalized answer values for PEW 2013
  • Figure 4: Spread of responses across the moral topics and countries for PEW 2013
  • Figure 5: Comparison between the degrees of cultural diversities and shared tendencies in the empirical moral ratings and language-model inferred moral scores for WVS
  • ...and 1 more figures