Table of Contents
Fetching ...

Normative Evaluation of Large Language Models with Everyday Moral Dilemmas

Pratik S. Sachdeva, Tom van Nuenen

TL;DR

The paper probes how large language models encode and apply everyday moral norms by auditing seven LLMs against real-world dilemmas from r/AmItheAsshole and comparing their judgments and explanations to Redditors. It combines prompting, repeat evaluations, and thematic analysis (six moral themes) to reveal substantial inter-model disagreement, moderate self-consistency, and varying alignment with human judgments. The study further demonstrates that ensemble model verdicts can approximate Redditor consensus, even as individual models diverge, highlighting the potential and limits of LLMs for ethically sensitive applications. The findings underscore the need for robust, nuanced evaluation frameworks and caution in deploying LLMs in moral guidance roles such as therapists or companions, given biases and opaque reasoning patterns. Overall, the work provides a methodological blueprint for evaluating moral reasoning in unstructured, real-world data and emphasizes accountability and transparency in AI moral reasoning.

Abstract

The rapid adoption of large language models (LLMs) has spurred extensive research into their encoded moral norms and decision-making processes. Much of this research relies on prompting LLMs with survey-style questions to assess how well models are aligned with certain demographic groups, moral beliefs, or political ideologies. While informative, the adherence of these approaches to relatively superficial constructs tends to oversimplify the complexity and nuance underlying everyday moral dilemmas. We argue that auditing LLMs along more detailed axes of human interaction is of paramount importance to better assess the degree to which they may impact human beliefs and actions. To this end, we evaluate LLMs on complex, everyday moral dilemmas sourced from the "Am I the Asshole" (AITA) community on Reddit, where users seek moral judgments on everyday conflicts from other community members. We prompted seven LLMs to assign blame and provide explanations for over 10,000 AITA moral dilemmas. We then compared the LLMs' judgments and explanations to those of Redditors and to each other, aiming to uncover patterns in their moral reasoning. Our results demonstrate that large language models exhibit distinct patterns of moral judgment, varying substantially from human evaluations on the AITA subreddit. LLMs demonstrate moderate to high self-consistency but low inter-model agreement. Further analysis of model explanations reveals distinct patterns in how models invoke various moral principles. These findings highlight the complexity of implementing consistent moral reasoning in artificial systems and the need for careful evaluation of how different models approach ethical judgment. As LLMs continue to be used in roles requiring ethical decision-making such as therapists and companions, careful evaluation is crucial to mitigate potential biases and limitations.

Normative Evaluation of Large Language Models with Everyday Moral Dilemmas

TL;DR

The paper probes how large language models encode and apply everyday moral norms by auditing seven LLMs against real-world dilemmas from r/AmItheAsshole and comparing their judgments and explanations to Redditors. It combines prompting, repeat evaluations, and thematic analysis (six moral themes) to reveal substantial inter-model disagreement, moderate self-consistency, and varying alignment with human judgments. The study further demonstrates that ensemble model verdicts can approximate Redditor consensus, even as individual models diverge, highlighting the potential and limits of LLMs for ethically sensitive applications. The findings underscore the need for robust, nuanced evaluation frameworks and caution in deploying LLMs in moral guidance roles such as therapists or companions, given biases and opaque reasoning patterns. Overall, the work provides a methodological blueprint for evaluating moral reasoning in unstructured, real-world data and emphasizes accountability and transparency in AI moral reasoning.

Abstract

The rapid adoption of large language models (LLMs) has spurred extensive research into their encoded moral norms and decision-making processes. Much of this research relies on prompting LLMs with survey-style questions to assess how well models are aligned with certain demographic groups, moral beliefs, or political ideologies. While informative, the adherence of these approaches to relatively superficial constructs tends to oversimplify the complexity and nuance underlying everyday moral dilemmas. We argue that auditing LLMs along more detailed axes of human interaction is of paramount importance to better assess the degree to which they may impact human beliefs and actions. To this end, we evaluate LLMs on complex, everyday moral dilemmas sourced from the "Am I the Asshole" (AITA) community on Reddit, where users seek moral judgments on everyday conflicts from other community members. We prompted seven LLMs to assign blame and provide explanations for over 10,000 AITA moral dilemmas. We then compared the LLMs' judgments and explanations to those of Redditors and to each other, aiming to uncover patterns in their moral reasoning. Our results demonstrate that large language models exhibit distinct patterns of moral judgment, varying substantially from human evaluations on the AITA subreddit. LLMs demonstrate moderate to high self-consistency but low inter-model agreement. Further analysis of model explanations reveals distinct patterns in how models invoke various moral principles. These findings highlight the complexity of implementing consistent moral reasoning in artificial systems and the need for careful evaluation of how different models approach ethical judgment. As LLMs continue to be used in roles requiring ethical decision-making such as therapists and companions, careful evaluation is crucial to mitigate potential biases and limitations.

Paper Structure

This paper contains 22 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Distributions of verdicts assigned by Redditors and LLMs to moral dilemmas. Bars represent the fraction of 10,826 submissions ($y$-axis) assigned each verdict ($x$-axis) by different models or Redditors (colors: legend). Error bars indicate bootstrapped 95% confidence intervals. For each verdict ($x$-axis ticks), colored bars appear in the order specified in the legend.
  • Figure 2: Large language models invoke distinct patterns of moral reasoning. Each point denotes a reasoning generated by a model on a moral dilemma. Colors denote the reasoning for a specific model or Redditors. Each reasoning in the AITA dataset was converted to embeddings via RoBERTA-Large. The 1024-dimensional embeddings were then reduced to 2 dimensions using UMAP.
  • Figure 3: Moral reasoning corresponds with assignment of blame. Each subplot corresponds to one of six moral themes established by yudkin_goodwin_reece_gray_bhatia_2023. The $y$-axes denote prevalence difference, which is the percentage difference in the rate at which a given verdict (NTA and YTA: $x$-axes) is assigned, when the moral theme is present vs. when it is not. Larger, positive $y$-axis values denote that a given verdict is used more often when the moral theme is used in an LLM's reasoning. Each color denotes a different model (models appear in the same order on the $x$-axis as they do in the legend). Only NTA and YTA are shown on the $x$-axis for brevity. Note the $y$-axis ranges are not consistent across subplots. Error bars denote 95% bootstrapped confidence intervals.
  • Figure 4: Plurality vote on moral dilemmas is generally dictated by larger, proprietary models. For each submission, the plurality verdict was determined to be the verdict which received the most number of votes across the 7 models. Multiple plurality verdicts were allowed. (a) The distribution of plurality verdict number, or the number of models participating in the plurality verdict. The $y$-axis is normalized relative to the total number of submissions. (b) The fraction of samples in which each model ($x$-axis) participates in the plurality vote.
  • Figure 5: Word Similarities of Moral Reasoning. Heatmap displays the average cosine similarity of TF-IDF representations for reasons generated by pairwise comparisons of models. Rows and columns correspond to different models. For each models, the 10,826 reasons were generated and converted into TF-IDF representations. The cosine similarity was calculated for all pairwise reasons, with the averages shown in the heatmap. Diagonal elements represent the average cosine similarity of reasons generated by replicates within the same model.
  • ...and 2 more figures