Table of Contents
Fetching ...

Are Rules Meant to be Broken? Understanding Multilingual Moral Reasoning as a Computational Pipeline with UniMoral

Shivani Kumar, David Jurgens

TL;DR

This work introduces UniMoral, a multilingual, holistic dataset and computational pipeline for studying the full moral reasoning process across six languages. By combining psychologically grounded dilemmas with socially sourced Reddit scenarios, and enriching annotations with action choices, ethical principles, contributing factors, consequences, and individual moral-cultural profiles, UniMoral enables four core evaluations on large language models: action prediction, moral typology classification, factor attribution, and consequence generation. The study reveals substantial language-dependent performance gaps, highlights the benefit of contextual cues such as moral values and persona, and shows that real-world Reddit data pose greater challenges than curated psychological scenarios. Overall, UniMoral advances cross-cultural moral reasoning research in NLP and offers a platform for exploring bias, cultural variation, and moral value quantification in AI systems.

Abstract

Moral reasoning is a complex cognitive process shaped by individual experiences and cultural contexts and presents unique challenges for computational analysis. While natural language processing (NLP) offers promising tools for studying this phenomenon, current research lacks cohesion, employing discordant datasets and tasks that examine isolated aspects of moral reasoning. We bridge this gap with UniMoral, a unified dataset integrating psychologically grounded and social-media-derived moral dilemmas annotated with labels for action choices, ethical principles, contributing factors, and consequences, alongside annotators' moral and cultural profiles. Recognizing the cultural relativity of moral reasoning, UniMoral spans six languages, Arabic, Chinese, English, Hindi, Russian, and Spanish, capturing diverse socio-cultural contexts. We demonstrate UniMoral's utility through a benchmark evaluations of three large language models (LLMs) across four tasks: action prediction, moral typology classification, factor attribution analysis, and consequence generation. Key findings reveal that while implicitly embedded moral contexts enhance the moral reasoning capability of LLMs, there remains a critical need for increasingly specialized approaches to further advance moral reasoning in these models.

Are Rules Meant to be Broken? Understanding Multilingual Moral Reasoning as a Computational Pipeline with UniMoral

TL;DR

This work introduces UniMoral, a multilingual, holistic dataset and computational pipeline for studying the full moral reasoning process across six languages. By combining psychologically grounded dilemmas with socially sourced Reddit scenarios, and enriching annotations with action choices, ethical principles, contributing factors, consequences, and individual moral-cultural profiles, UniMoral enables four core evaluations on large language models: action prediction, moral typology classification, factor attribution, and consequence generation. The study reveals substantial language-dependent performance gaps, highlights the benefit of contextual cues such as moral values and persona, and shows that real-world Reddit data pose greater challenges than curated psychological scenarios. Overall, UniMoral advances cross-cultural moral reasoning research in NLP and offers a platform for exploring bias, cultural variation, and moral value quantification in AI systems.

Abstract

Moral reasoning is a complex cognitive process shaped by individual experiences and cultural contexts and presents unique challenges for computational analysis. While natural language processing (NLP) offers promising tools for studying this phenomenon, current research lacks cohesion, employing discordant datasets and tasks that examine isolated aspects of moral reasoning. We bridge this gap with UniMoral, a unified dataset integrating psychologically grounded and social-media-derived moral dilemmas annotated with labels for action choices, ethical principles, contributing factors, and consequences, alongside annotators' moral and cultural profiles. Recognizing the cultural relativity of moral reasoning, UniMoral spans six languages, Arabic, Chinese, English, Hindi, Russian, and Spanish, capturing diverse socio-cultural contexts. We demonstrate UniMoral's utility through a benchmark evaluations of three large language models (LLMs) across four tasks: action prediction, moral typology classification, factor attribution analysis, and consequence generation. Key findings reveal that while implicitly embedded moral contexts enhance the moral reasoning capability of LLMs, there remains a critical need for increasingly specialized approaches to further advance moral reasoning in these models.

Paper Structure

This paper contains 43 sections, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Moral Reasoning pipeline: An individual encounters a moral scenario, they list out the potential actions they can take, and select one. The chosen action yields outcomes affecting stakeholders and societal norms. The "Moralsphere" conceptualizes this dynamic interplay between reasoning, action, and societal impact in resolving moral dilemmas.
  • Figure 2: UniMoral at a glance. [Abbreviations -- PDI: Power Distance, IDV: Individualism, MAS: Masculity, UAI: Uncertainty Avoidance, LTO: Long Term Orientation, IVR: Indulgence vs Restraint]
  • Figure 3: Models perform best in English, Spanish, and Russian while struggling for Arabic, Chinese, and Hindi as shown by their language-specific performance. The scores are average weighted F1 scores for AP, MTC, and FAA, and BERTScore for CG. Dotted line represents random performance for each task.
  • Figure 4: Contextual-cues like moral values and persona help LLMs make better moral decisions. The scores are average weighted F1 scores. Dotted line represents random performance for each task.
  • Figure 5: Models perform better on psychologically grounded scenarios than on Reddit-based dilemmas across all tasks and languages. The scores are average weighted F1 scores for AP, MTC, and FAA, and BERTScore for CG. Dotted line represents random performance for each task.
  • ...and 5 more figures