Table of Contents
Fetching ...

MFTCXplain: A Multilingual Benchmark Dataset for Evaluating the Moral Reasoning of LLMs through Multi-hop Hate Speech Explanation

Jackson Trager, Francielle Vargas, Diego Alves, Matteo Guida, Mikel K. Ngueajio, Ameeta Agrawal, Yalda Daryani, Farzan Karimi-Malekabadi, Flor Miriam Plaza-del-Arco

TL;DR

MFTCXplain addresses the need for culturally grounded evaluation of moral reasoning in language models by introducing a multilingual benchmark that links hate speech to Moral Foundations Theory via multi-hop explanations. The corpus comprises 3,000 tweets across English, Italian, Persian, and Portuguese, annotated with hate labels, ten moral categories, and text-span rationales, enabling interpretable model analysis. Experiments with prompting large language models show strong hate-speech detection but markedly weaker moral-sentiment prediction and rationale alignment, particularly in underrepresented languages, underscoring current limitations in cross-cultural moral reasoning. The work provides a novel annotation schema, a cross-linguistic analysis of moral framing in hate speech, and publicly available data and code to drive future improvements in multilingual, explainable moral NLP.

Abstract

Ensuring the moral reasoning capabilities of Large Language Models (LLMs) is a growing concern as these systems are used in socially sensitive tasks. Nevertheless, current evaluation benchmarks present two major shortcomings: a lack of annotations that justify moral classifications, which limits transparency and interpretability; and a predominant focus on English, which constrains the assessment of moral reasoning across diverse cultural settings. In this paper, we introduce MFTCXplain, a multilingual benchmark dataset for evaluating the moral reasoning of LLMs via multi-hop hate speech explanation using the Moral Foundations Theory. MFTCXplain comprises 3,000 tweets across Portuguese, Italian, Persian, and English, annotated with binary hate speech labels, moral categories, and text span-level rationales. Our results show a misalignment between LLM outputs and human annotations in moral reasoning tasks. While LLMs perform well in hate speech detection (F1 up to 0.836), their ability to predict moral sentiments is notably weak (F1 < 0.35). Furthermore, rationale alignment remains limited mainly in underrepresented languages. Our findings show the limited capacity of current LLMs to internalize and reflect human moral reasoning

MFTCXplain: A Multilingual Benchmark Dataset for Evaluating the Moral Reasoning of LLMs through Multi-hop Hate Speech Explanation

TL;DR

MFTCXplain addresses the need for culturally grounded evaluation of moral reasoning in language models by introducing a multilingual benchmark that links hate speech to Moral Foundations Theory via multi-hop explanations. The corpus comprises 3,000 tweets across English, Italian, Persian, and Portuguese, annotated with hate labels, ten moral categories, and text-span rationales, enabling interpretable model analysis. Experiments with prompting large language models show strong hate-speech detection but markedly weaker moral-sentiment prediction and rationale alignment, particularly in underrepresented languages, underscoring current limitations in cross-cultural moral reasoning. The work provides a novel annotation schema, a cross-linguistic analysis of moral framing in hate speech, and publicly available data and code to drive future improvements in multilingual, explainable moral NLP.

Abstract

Ensuring the moral reasoning capabilities of Large Language Models (LLMs) is a growing concern as these systems are used in socially sensitive tasks. Nevertheless, current evaluation benchmarks present two major shortcomings: a lack of annotations that justify moral classifications, which limits transparency and interpretability; and a predominant focus on English, which constrains the assessment of moral reasoning across diverse cultural settings. In this paper, we introduce MFTCXplain, a multilingual benchmark dataset for evaluating the moral reasoning of LLMs via multi-hop hate speech explanation using the Moral Foundations Theory. MFTCXplain comprises 3,000 tweets across Portuguese, Italian, Persian, and English, annotated with binary hate speech labels, moral categories, and text span-level rationales. Our results show a misalignment between LLM outputs and human annotations in moral reasoning tasks. While LLMs perform well in hate speech detection (F1 up to 0.836), their ability to predict moral sentiments is notably weak (F1 < 0.35). Furthermore, rationale alignment remains limited mainly in underrepresented languages. Our findings show the limited capacity of current LLMs to internalize and reflect human moral reasoning

Paper Structure

This paper contains 46 sections, 4 equations, 9 figures, 18 tables.

Figures (9)

  • Figure 1: Multi-hop hate speech explanation for moral reasoning evaluation.
  • Figure 2: Kappa for MFT categories between human annotators of the Portuguese corpus.
  • Figure 3: Performance of GPT-4o across three tasks: Hate Speech, Moral Violations, and Rationale Extraction.
  • Figure 4: Performance of LLaMA-70B across three tasks: Hate Speech, Moral Violations, and Rationale Extraction.
  • Figure 5: Hierarchical clustering of moral categories across languages based on lemmas.
  • ...and 4 more figures