Do Moral Judgment and Reasoning Capability of LLMs Change with Language? A Study using the Multilingual Defining Issues Test
Aditi Khandelwal, Utkarsh Agarwal, Kumar Tanmay, Monojit Choudhury
TL;DR
This work investigates whether LLMs' moral judgment and moral reasoning vary with language by extending the Defining Issues Test (DIT) to five languages and evaluating GPT-4, ChatGPT, and Llama2-Chat-70B. By translating dilemmas and prompts, extracting judgments and top moral considerations, and computing CMD-based scores such as the post-conventional $P_{score}$, the study reveals pronounced language effects: Hindi and Swahili yield substantially lower $P_{score}$ and more variable judgments, while English and Spanish generally perform best and Chinese/Russian show mixed results depending on the model. Among models, GPT-4 consistently achieves higher cross-language consistency and stronger post-conventional reasoning, with ChatGPT and Llama2-Chat displaying greater language-dependent variation. The work offers a multilingual DIT dataset and highlights important considerations for ethical evaluation of multilingual LLMs, including data quality, cultural context, and the need for diverse dilemma corpora to avoid biased conclusions.
Abstract
This paper explores the moral judgment and moral reasoning abilities exhibited by Large Language Models (LLMs) across languages through the Defining Issues Test. It is a well known fact that moral judgment depends on the language in which the question is asked. We extend the work of beyond English, to 5 new languages (Chinese, Hindi, Russian, Spanish and Swahili), and probe three LLMs -- ChatGPT, GPT-4 and Llama2Chat-70B -- that shows substantial multilingual text processing and generation abilities. Our study shows that the moral reasoning ability for all models, as indicated by the post-conventional score, is substantially inferior for Hindi and Swahili, compared to Spanish, Russian, Chinese and English, while there is no clear trend for the performance of the latter four languages. The moral judgments too vary considerably by the language.
