Understanding Moral Reasoning Trajectories in Large Language Models: Toward Probing-Based Explainability

Fan Huang; Haewoon Kwak; Jisun An

Understanding Moral Reasoning Trajectories in Large Language Models: Toward Probing-Based Explainability

Fan Huang, Haewoon Kwak, Jisun An

Abstract

Large language models (LLMs) increasingly participate in morally sensitive decision-making, yet how they organize ethical frameworks across reasoning steps remains underexplored. We introduce \textit{moral reasoning trajectories}, sequences of ethical framework invocations across intermediate reasoning steps, and analyze their dynamics across six models and three benchmarks. We find that moral reasoning involves systematic multi-framework deliberation: 55.4--57.7\% of consecutive steps involve framework switches, and only 16.4--17.8\% of trajectories remain framework-consistent. Unstable trajectories remain 1.29$\times$ more susceptible to persuasive attacks ($p=0.015$). At the representation level, linear probes localize framework-specific encoding to model-specific layers (layer 63/81 for Llama-3.3-70B; layer 17/81 for Qwen2.5-72B), achieving 13.8--22.6\% lower KL divergence than the training-set prior baseline. Lightweight activation steering modulates framework integration patterns (6.7--8.9\% drift reduction) and amplifies the stability--accuracy relationship. We further propose a Moral Representation Consistency (MRC) metric that correlates strongly ($r=0.715$, $p<0.0001$) with LLM coherence ratings, whose underlying framework attributions are validated by human annotators (mean cosine similarity $= 0.859$).

Understanding Moral Reasoning Trajectories in Large Language Models: Toward Probing-Based Explainability

Abstract

more susceptible to persuasive attacks (

). At the representation level, linear probes localize framework-specific encoding to model-specific layers (layer 63/81 for Llama-3.3-70B; layer 17/81 for Qwen2.5-72B), achieving 13.8--22.6\% lower KL divergence than the training-set prior baseline. Lightweight activation steering modulates framework integration patterns (6.7--8.9\% drift reduction) and amplifies the stability--accuracy relationship. We further propose a Moral Representation Consistency (MRC) metric that correlates strongly (

) with LLM coherence ratings, whose underlying framework attributions are validated by human annotators (mean cosine similarity

Paper Structure (199 sections, 6 equations, 23 figures, 27 tables)

This paper contains 199 sections, 6 equations, 23 figures, 27 tables.

Introduction
Related Work
Foundations of Morality Frameworks
Kantian Deontology
Benthamite Act Utilitarianism
Aristotelian Virtue Ethics
Scanlonian Contractualism
Gauthierian Contractarianism
LLM Moral Reasoning Evaluation
Probing and Mechanistic Interpretability
Experimental Design
Datasets
Prompting Methodology
Scoring Model Selection
Foundational Experiment: Structure $\times$ Framework Constraint
...and 184 more sections

Figures (23)

Figure 1: Framework attribution trajectories across reasoning steps (shaded regions indicate 95% confidence intervals). Sample sizes: GPT-5 $n=1{,}199$, Llama-3.3-70B $n=1{,}200$, Qwen2.5-72B $n=1{,}197$, out of 1,200 requested per model; shortfalls are due to API or JSON parsing failures. All models show increased Utilitarianism at Step 3; model-specific patterns emerge elsewhere. Contractarianism (bottom lines) is consistently underrepresented.
Figure 2: Classification accuracy by binned FDR for each model, before (solid) and after (dashed) activation steering. Error bars indicate 95% CIs.
Figure 3: Layer-wise probe performance predicting 5D moral framework distributions. Stars mark optimal layers. Llama peaks late (layer 63, 78%); Qwen peaks early (layer 17, 21%). Dashed lines show baselines.
Figure 4: Moral reasoning trajectories across six LLMs and three ethical datasets. Each subplot visualizes the step-level ethical soundness score progression (0--100%) through sequential reasoning steps for a specific model. Individual trajectories (thin lines with transparency) represent single moral scenarios, while bold lines show dataset-averaged patterns. Three complementary datasets are shown: Moral Stories (narrative-based contrastive moral reasoning), ETHICS (binary ethical judgments across five moral frameworks), and Social Chemistry 101 (social norm evaluation grounded in Moral Foundations Theory). Diamond markers indicate final moral judgments. Key observations include: all models maintain relatively high soundness scores (70.0--90.0%) across reasoning steps; GPT-5 exhibits the longest trajectories with 5--8 reasoning steps while o3-mini produces more concise chains; Moral Stories consistently elicits the highest soundness scores across all models; and ETHICS scenarios show more variance in earlier reasoning steps, reflecting morally ambiguous cases across ethical frameworks.
Figure 5: Ethical framework distribution across six pilot models. Contractualism and Deontology dominate; Virtue Ethics is underrepresented.
...and 18 more figures

Understanding Moral Reasoning Trajectories in Large Language Models: Toward Probing-Based Explainability

Abstract

Understanding Moral Reasoning Trajectories in Large Language Models: Toward Probing-Based Explainability

Authors

Abstract

Table of Contents

Figures (23)