Table of Contents
Fetching ...

Beyond Human Judgment: A Bayesian Evaluation of LLMs' Moral Values Understanding

Maciej Skorski, Alina Landowska

TL;DR

The paper tackles how Large Language Models understand moral values by introducing a Bayesian annotation framework that explicitly models annotator disagreement to separate aleatoric and epistemic uncertainty. It analyzes three market-leading LLMs—Claude Sonnet 4, DeepSeek-V3, and Llama 4 Maverick—across 100K+ texts and 250K+ moral annotations using a GPU-optimized Dawid-Skene–style approach to infer ground-truth labels. The results show AI systems typically rank in the top quartile of human annotators and achieve substantially lower false negative rates than humans, with a modest uptick in false positives, across multiple datasets. This uncertainty-aware evaluation provides a scalable, robust method for assessing moral foundation detection in LLMs while highlighting practical considerations for deployment, including calibration and cross-domain generalizability, and acknowledging limitations such as distribution shift and prompt sensitivity.

Abstract

How do Large Language Models understand moral dimensions compared to humans? This first large-scale Bayesian evaluation of market-leading language models provides the answer. In contrast to prior work using deterministic ground truth (majority or inclusion rules), we model annotator disagreements to capture both aleatoric uncertainty (inherent human disagreement) and epistemic uncertainty (model domain sensitivity). We evaluated the best language models (Claude Sonnet 4, DeepSeek-V3, Llama 4 Maverick) across 250K+ annotations from nearly 700 annotators in 100K+ texts spanning social networks, news and forums. Our GPU-optimized Bayesian framework processed 1M+ model queries, revealing that AI models typically rank among the top 25\% of human annotators, performing much better than average balanced accuracy. Importantly, we find that AI produces far fewer false negatives than humans, highlighting their more sensitive moral detection capabilities.

Beyond Human Judgment: A Bayesian Evaluation of LLMs' Moral Values Understanding

TL;DR

The paper tackles how Large Language Models understand moral values by introducing a Bayesian annotation framework that explicitly models annotator disagreement to separate aleatoric and epistemic uncertainty. It analyzes three market-leading LLMs—Claude Sonnet 4, DeepSeek-V3, and Llama 4 Maverick—across 100K+ texts and 250K+ moral annotations using a GPU-optimized Dawid-Skene–style approach to infer ground-truth labels. The results show AI systems typically rank in the top quartile of human annotators and achieve substantially lower false negative rates than humans, with a modest uptick in false positives, across multiple datasets. This uncertainty-aware evaluation provides a scalable, robust method for assessing moral foundation detection in LLMs while highlighting practical considerations for deployment, including calibration and cross-domain generalizability, and acknowledging limitations such as distribution shift and prompt sensitivity.

Abstract

How do Large Language Models understand moral dimensions compared to humans? This first large-scale Bayesian evaluation of market-leading language models provides the answer. In contrast to prior work using deterministic ground truth (majority or inclusion rules), we model annotator disagreements to capture both aleatoric uncertainty (inherent human disagreement) and epistemic uncertainty (model domain sensitivity). We evaluated the best language models (Claude Sonnet 4, DeepSeek-V3, Llama 4 Maverick) across 250K+ annotations from nearly 700 annotators in 100K+ texts spanning social networks, news and forums. Our GPU-optimized Bayesian framework processed 1M+ model queries, revealing that AI models typically rank among the top 25\% of human annotators, performing much better than average balanced accuracy. Importantly, we find that AI produces far fewer false negatives than humans, highlighting their more sensitive moral detection capabilities.

Paper Structure

This paper contains 25 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Graphical model representation of the model for multi-annotator classification. Light gray circles represent latent variables, dark gray rectangles represent observed variables, white circles represent parameters, and blue rectangles represent hyperparameters. Plates indicate replication over items ($N$) and annotators ($J$).
  • Figure 2: DeepSeek-V3 vs human accuracy (MFTC).
  • Figure 3: Claude Sonnet 4 vs human accuracy (MFRC).
  • Figure 4: Llama 4 Maverick vs humans (eMFD).
  • Figure 5: Error trade-offs in moral foundation detection. AI models (shapes) vs human baselines (circles) across datasets with colors denoting moral foundations. Diagonal lines indicate error balance (FPR = FNR).
  • ...and 1 more figures