Table of Contents
Fetching ...

The Moral Turing Test: Evaluating Human-LLM Alignment in Moral Decision-Making

Basile Garcia, Crystal Qian, Stefano Palminteri

TL;DR

This study probes how humans and GPT-3.5-derived judgments align in moral decision-making, examining both the detectability of AI-generated moral reasoning and the degree to which people agree with those judgments. By constructing three corpora of human and AI responses to moral scenarios and running a large participant study, the authors reveal a nuanced, context-dependent alignment: humans favor non-moral judgments, prefer AI judgments in personal moral scenarios, and exhibit a robust anti-AI bias regarding source attribution. Lingual cues and semantic patterns enable moderate detection of AI authorship, while predictive models show only modest accuracy in predicting provenance and agreement, indicating that detection signals are subtle and evanescent. The work highlights the complexity of human-AI interaction in morally charged contexts and underscores implications for using LLMs as decision-support while guarding against bias and deception.

Abstract

As large language models (LLMs) become increasingly integrated into society, their alignment with human morals is crucial. To better understand this alignment, we created a large corpus of human- and LLM-generated responses to various moral scenarios. We found a misalignment between human and LLM moral assessments; although both LLMs and humans tended to reject morally complex utilitarian dilemmas, LLMs were more sensitive to personal framing. We then conducted a quantitative user study involving 230 participants (N=230), who evaluated these responses by determining whether they were AI-generated and assessed their agreement with the responses. Human evaluators preferred LLMs' assessments in moral scenarios, though a systematic anti-AI bias was observed: participants were less likely to agree with judgments they believed to be machine-generated. Statistical and NLP-based analyses revealed subtle linguistic differences in responses, influencing detection and agreement. Overall, our findings highlight the complexities of human-AI perception in morally charged decision-making.

The Moral Turing Test: Evaluating Human-LLM Alignment in Moral Decision-Making

TL;DR

This study probes how humans and GPT-3.5-derived judgments align in moral decision-making, examining both the detectability of AI-generated moral reasoning and the degree to which people agree with those judgments. By constructing three corpora of human and AI responses to moral scenarios and running a large participant study, the authors reveal a nuanced, context-dependent alignment: humans favor non-moral judgments, prefer AI judgments in personal moral scenarios, and exhibit a robust anti-AI bias regarding source attribution. Lingual cues and semantic patterns enable moderate detection of AI authorship, while predictive models show only modest accuracy in predicting provenance and agreement, indicating that detection signals are subtle and evanescent. The work highlights the complexity of human-AI interaction in morally charged contexts and underscores implications for using LLMs as decision-support while guarding against bias and deception.

Abstract

As large language models (LLMs) become increasingly integrated into society, their alignment with human morals is crucial. To better understand this alignment, we created a large corpus of human- and LLM-generated responses to various moral scenarios. We found a misalignment between human and LLM moral assessments; although both LLMs and humans tended to reject morally complex utilitarian dilemmas, LLMs were more sensitive to personal framing. We then conducted a quantitative user study involving 230 participants (N=230), who evaluated these responses by determining whether they were AI-generated and assessed their agreement with the responses. Human evaluators preferred LLMs' assessments in moral scenarios, though a systematic anti-AI bias was observed: participants were less likely to agree with judgments they believed to be machine-generated. Statistical and NLP-based analyses revealed subtle linguistic differences in responses, influencing detection and agreement. Overall, our findings highlight the complexities of human-AI perception in morally charged decision-making.

Paper Structure

This paper contains 37 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: A diagram showing the experimental design. The blue and orange boxes (Steps 1 and 3) correspond to respectively Corpus generation and Detection and Agreement experiments (see Figure 2). The purple boxes (Steps 2, 4, 5, and 6) denote quantitative, statistical, and computational methods. The output of the experiment is an analysis on three corpora of human- and LLM- generated responses to various scenarios.
  • Figure 2: (A) Schematic interface and method used in the experiment we used to generate corpus 1 (human and dv2 responses) and corpus 2 (human and dv3 responses). (B) Schematic interface used in the 3. corpus evaluation step.
  • Figure 3: (A) Identified linguistic features which have been found to be different between human- and LLM-generated responses. (dv2: text-davinci-002; dv3: text-davinci-003, dv2h: humanized dv2). (B) Schematized prompting strategy to generate the humanized LLM response, by reducing size and including typos.
  • Figure 4: (A) Example of scenarios across three categories (taken from Greene et al. 2004). (B) Endorsement of the different moral actions as a function of category of scenario; ‘non moral’ refers to scenarios with no moral stakes; ‘impersonal moral’ refers to scenarios with moral scenario whose resolution does not involve a direct, personal involvement of the participant (emotionally non-engaging); ‘personal moral’ refers to moral scenario whose resolution involve a direct involvement of the participant (emotionally engaging). Note, what is asked in moral scenario is judging the appropriateness of the utilitarian response. (C) Same as (B), but for the two considered LLMs; DV2= text-davinci-002, DV3: text-davinci-003.
  • Figure 5: (A) Probability of correctly detecting the source of the judgement (p(correct identification)), as a function of the scenario type in Corpus 1 (leftmost column; Exp. 1 DV2), Corpus 2 (central column; Exp. 2 DV3), and on average (rightmost column; Average). (B) Difference in agreement between the trials featuring human-generated items and those featuring LLM-generated items as a function of the scenario type. (C) Difference in agreement between the trials the participant declared as being human-generated and those declared to be LLM-generated (belief).
  • ...and 3 more figures