Table of Contents
Fetching ...

The Fragility Of Moral Judgment In Large Language Models

Tom van Nuenen, Pratik S. Sachdeva

TL;DR

The results show that LLM moral judgments are co-produced by narrative form and task scaffolding, raising reproducibility and equity concerns when outcomes depend on presentation skill rather than moral substance.

Abstract

People increasingly use large language models (LLMs) for everyday moral and interpersonal guidance, yet these systems cannot interrogate missing context and judge dilemmas as presented. We introduce a perturbation framework for testing the stability and manipulability of LLM moral judgments while holding the underlying moral conflict constant. Using 2,939 dilemmas from r/AmItheAsshole (January-March 2025), we generate three families of content perturbations: surface edits (lexical/structural noise), point-of-view shifts (voice and stance neutralization), and persuasion cues (self-positioning, social proof, pattern admissions, victim framing). We also vary the evaluation protocol (output ordering, instruction placement, and unstructured prompting). We evaluated all variants with four models (GPT-4.1, Claude 3.7 Sonnet, DeepSeek V3, Qwen2.5-72B) (N=129,156 judgments). Surface perturbations produce low flip rates (7.5%), largely within the self-consistency noise floor (4-13%), whereas point-of-view shifts induce substantially higher instability (24.3%). A large subset of dilemmas (37.9%) is robust to surface noise yet flips under perspective changes, indicating that models condition on narrative voice as a pragmatic cue. Instability concentrates in morally ambiguous cases; scenarios where no party is assigned blame are most susceptible. Persuasion perturbations yield systematic directional shifts. Protocol choices dominate all other factors: agreement between structured protocols is only 67.6% (kappa=0.55), and only 35.7% of model-scenario units match across all three protocols. These results show that LLM moral judgments are co-produced by narrative form and task scaffolding, raising reproducibility and equity concerns when outcomes depend on presentation skill rather than moral substance.

The Fragility Of Moral Judgment In Large Language Models

TL;DR

The results show that LLM moral judgments are co-produced by narrative form and task scaffolding, raising reproducibility and equity concerns when outcomes depend on presentation skill rather than moral substance.

Abstract

People increasingly use large language models (LLMs) for everyday moral and interpersonal guidance, yet these systems cannot interrogate missing context and judge dilemmas as presented. We introduce a perturbation framework for testing the stability and manipulability of LLM moral judgments while holding the underlying moral conflict constant. Using 2,939 dilemmas from r/AmItheAsshole (January-March 2025), we generate three families of content perturbations: surface edits (lexical/structural noise), point-of-view shifts (voice and stance neutralization), and persuasion cues (self-positioning, social proof, pattern admissions, victim framing). We also vary the evaluation protocol (output ordering, instruction placement, and unstructured prompting). We evaluated all variants with four models (GPT-4.1, Claude 3.7 Sonnet, DeepSeek V3, Qwen2.5-72B) (N=129,156 judgments). Surface perturbations produce low flip rates (7.5%), largely within the self-consistency noise floor (4-13%), whereas point-of-view shifts induce substantially higher instability (24.3%). A large subset of dilemmas (37.9%) is robust to surface noise yet flips under perspective changes, indicating that models condition on narrative voice as a pragmatic cue. Instability concentrates in morally ambiguous cases; scenarios where no party is assigned blame are most susceptible. Persuasion perturbations yield systematic directional shifts. Protocol choices dominate all other factors: agreement between structured protocols is only 67.6% (kappa=0.55), and only 35.7% of model-scenario units match across all three protocols. These results show that LLM moral judgments are co-produced by narrative form and task scaffolding, raising reproducibility and equity concerns when outcomes depend on presentation skill rather than moral substance.
Paper Structure (71 sections, 4 equations, 4 figures, 9 tables)

This paper contains 71 sections, 4 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Baseline Uncertainty Predicts Perturbation Instability.(a) Relationship between baseline normalized entropy (NE) and perturbation flip rate across four models. NE is computed from runs 2--15 using a split-sample approach; flip rates are computed against held-out run 1 to avoid leakage. Lines show quasibinomial GLM fits (logit link). (b) Flip rates by baseline verdict category, aggregated across all models.
  • Figure 2: Verdict Instability and Asymmetric Blame Attribution by Model.(a) Flip rates by base verdict category and perturbation family. (b) Net blame direction by model and perturbation type. Values represent $(\text{flips toward blaming narrator} - \text{flips toward exonerating narrator}) / (\text{total directional flips})$, ranging from $-1$ (all flips exonerate narrator) to $+1$ (all flips blame narrator), with $0$ indicating balanced effects. Red cells indicate perturbations that shift blame toward the narrator; blue cells indicate shifts toward exonerating the narrator.
  • Figure 3: Epistemic Stance and Cross-Model Agreement Under Verification.(a) Change in net epistemic stance (boosters $-$ hedges per 100 words) between baseline and perturbed explanations. Negative values indicate more hedged, tentative language; positive values indicate more confident, direct language. Error bars show 95% confidence intervals. (b) Cross-model agreement by scenario-level verification intensity on reasoning traces. Scenarios are binned by total verification count across all models and protocols.
  • Figure 4: Protocol Instability and Distributed Blame Verdict Fate.(a) Verdict flip rates under three protocol perturbations (Explanation First, System Prompt, Unstructured) compared to the main study's verdict-first protocol, grouped by base verdict category. (b) Fate of distributed blame verdicts (All At Fault or No One At Fault) under content and protocol perturbations. Stacked bars show the percentage of verdicts that were retained versus shifted to clear blame attribution (Other At Fault or Self At Fault).