Probing the Moral Development of Large Language Models through Defining Issues Test

Kumar Tanmay; Aditi Khandelwal; Utkarsh Agarwal; Monojit Choudhury

Probing the Moral Development of Large Language Models through Defining Issues Test

Kumar Tanmay, Aditi Khandelwal, Utkarsh Agarwal, Monojit Choudhury

TL;DR

The paper evaluates moral development in seven LLMs by applying Rest's Defining Issues Test within Kohlberg's Cognitive Moral Development framework, deriving P-scores and stage-based scores from responses to nine dilemmas (five from DIT-1 plus four novel scenarios). It demonstrates that GPT-4 exhibits post-conventional moral reasoning comparable to graduate students, while GPT-3 remains near random, with others showing conventional-level reasoning and notable dilemma-specific variability. The authors discuss methodological limitations of applying DIT to AI, potential sources of emergent moral behavior, and the need for further, culturally aware ethical evaluation of LLMs. Overall, the work provides a structured, dilemma-based approach to quantifying AI moral reasoning and highlights substantial gaps and ethical considerations for deployment.

Abstract

In this study, we measure the moral reasoning ability of LLMs using the Defining Issues Test - a psychometric instrument developed for measuring the moral development stage of a person according to the Kohlberg's Cognitive Moral Development Model. DIT uses moral dilemmas followed by a set of ethical considerations that the respondent has to judge for importance in resolving the dilemma, and then rank-order them by importance. A moral development stage score of the respondent is then computed based on the relevance rating and ranking. Our study shows that early LLMs such as GPT-3 exhibit a moral reasoning ability no better than that of a random baseline, while ChatGPT, Llama2-Chat, PaLM-2 and GPT-4 show significantly better performance on this task, comparable to adult humans. GPT-4, in fact, has the highest post-conventional moral reasoning score, equivalent to that of typical graduate school students. However, we also observe that the models do not perform consistently across all dilemmas, pointing to important gaps in their understanding and reasoning abilities.

Probing the Moral Development of Large Language Models through Defining Issues Test

TL;DR

Abstract

Paper Structure (13 sections, 1 equation, 8 figures)

This paper contains 13 sections, 1 equation, 8 figures.

Introduction
Background and Related Work
Morality and Moral Development
Rest's Defining Issues Test
Recent Theories in Moral Philosophy
Current Approaches to Ethics of LLMs
Data and Method
Dataset
Experimental Setup
Metrics
Results and Observations
Discussion and Conclusion
Dilemmas

Figures (8)

Figure 1: Prompt structure illustrated for the Monica's Dilemma.
Figure 2: Dilemma wise $p_{score}$ comparison across LLMs. The dotted line shows the random baseline $p_{score}$ for the dilemma.
Figure 3: Model-wise scores and their dilemma-wise resolutions. PaLM-2 results are from 8 dilemmas (Sec. \ref{['sec:results']}). In Fig-(b), the colors' RGB components depict the fraction of runs with corresponding resolutions (Green - O1(Should do), Blue - O2(Can't Decide), Red - O3(Shouldn't do))
Figure 4: Story and 12 statements for Rajesh's Dilemma
Figure 5: Story and 12 statements for Monica's Dilemma
...and 3 more figures

Probing the Moral Development of Large Language Models through Defining Issues Test

TL;DR

Abstract

Probing the Moral Development of Large Language Models through Defining Issues Test

Authors

TL;DR

Abstract

Table of Contents

Figures (8)