Do Moral Judgment and Reasoning Capability of LLMs Change with Language? A Study using the Multilingual Defining Issues Test

Aditi Khandelwal; Utkarsh Agarwal; Kumar Tanmay; Monojit Choudhury

Do Moral Judgment and Reasoning Capability of LLMs Change with Language? A Study using the Multilingual Defining Issues Test

Aditi Khandelwal, Utkarsh Agarwal, Kumar Tanmay, Monojit Choudhury

TL;DR

This work investigates whether LLMs' moral judgment and moral reasoning vary with language by extending the Defining Issues Test (DIT) to five languages and evaluating GPT-4, ChatGPT, and Llama2-Chat-70B. By translating dilemmas and prompts, extracting judgments and top moral considerations, and computing CMD-based scores such as the post-conventional $P_{score}$, the study reveals pronounced language effects: Hindi and Swahili yield substantially lower $P_{score}$ and more variable judgments, while English and Spanish generally perform best and Chinese/Russian show mixed results depending on the model. Among models, GPT-4 consistently achieves higher cross-language consistency and stronger post-conventional reasoning, with ChatGPT and Llama2-Chat displaying greater language-dependent variation. The work offers a multilingual DIT dataset and highlights important considerations for ethical evaluation of multilingual LLMs, including data quality, cultural context, and the need for diverse dilemma corpora to avoid biased conclusions.

Abstract

This paper explores the moral judgment and moral reasoning abilities exhibited by Large Language Models (LLMs) across languages through the Defining Issues Test. It is a well known fact that moral judgment depends on the language in which the question is asked. We extend the work of beyond English, to 5 new languages (Chinese, Hindi, Russian, Spanish and Swahili), and probe three LLMs -- ChatGPT, GPT-4 and Llama2Chat-70B -- that shows substantial multilingual text processing and generation abilities. Our study shows that the moral reasoning ability for all models, as indicated by the post-conventional score, is substantially inferior for Hindi and Swahili, compared to Spanish, Russian, Chinese and English, while there is no clear trend for the performance of the latter four languages. The moral judgments too vary considerably by the language.

Do Moral Judgment and Reasoning Capability of LLMs Change with Language? A Study using the Multilingual Defining Issues Test

TL;DR

, the study reveals pronounced language effects: Hindi and Swahili yield substantially lower

and more variable judgments, while English and Spanish generally perform best and Chinese/Russian show mixed results depending on the model. Among models, GPT-4 consistently achieves higher cross-language consistency and stronger post-conventional reasoning, with ChatGPT and Llama2-Chat displaying greater language-dependent variation. The work offers a multilingual DIT dataset and highlights important considerations for ethical evaluation of multilingual LLMs, including data quality, cultural context, and the need for diverse dilemma corpora to avoid biased conclusions.

Abstract

Paper Structure (19 sections, 2 equations, 5 figures, 1 table)

This paper contains 19 sections, 2 equations, 5 figures, 1 table.

Introduction
Background: Moral Psychology and Ethics of NLP
Defining Issues Test
Moral Judgment vs. Moral Reasoning
Language and Morality
Current Approaches to Ethics of LLMs
Performance of LLMs across Languages
Experiments
Dataset and Prompt
Experimental Setup
Method
Metrics
Results and Observation
Moral Judgment by the LLMs
Moral Reasoning by LLMs
...and 4 more sections

Figures (5)

Figure 1: Dilemma-specific resolution heatmaps across various languages for ChatGPT, Llama2chat-70B, and GPT-4. O1 is indicated in green, O2 in blue, and O3 in red. The heatmaps illustrate the number of instances where the models provided answers corresponding to O1, O2, or O3 for each language and dilemma based on the RGB component. White areas represent scenarios where no observations yielded an extractable resolution to the dilemma.
Figure 2: Overview of stage-wise scores for ChatGPT, Llama2Chat, and GPT-4, averaged across all moral dilemmas. The cumulative scores of the initial three tiers (Red, Orange, and Deep Yellow) is the $p_{score}$ or post-conventional morality score. The 4th tier (light yellow) signifies the Maintaining Norms schema score and the 5th and 6th tiers (green and blue) combined gives the Personal Interests schema score.
Figure 3: Comparing dilemma-specific and overall P-scores among ChatGPT, Llama2Chat, and GPT-4, versus the random baselines, across five languages for ChatGPT and Llama2Chat (excluding Hindi) and six languages for GPT-4.
Figure 4: An illustration of contemporary Language Models with the world cultural map rao2023ethical.
Figure 5: Prompt structure illustrated for the Monica's Dilemma in Hindi

Do Moral Judgment and Reasoning Capability of LLMs Change with Language? A Study using the Multilingual Defining Issues Test

TL;DR

Abstract

Do Moral Judgment and Reasoning Capability of LLMs Change with Language? A Study using the Multilingual Defining Issues Test

Authors

TL;DR

Abstract

Table of Contents

Figures (5)