Table of Contents
Fetching ...

Probing Ethical Framework Representations in Large Language Models: Structure, Entanglement, and Methodological Challenges

Weilun Xu, Alexander Rusnak, Frederic Kaplan

Abstract

When large language models make ethical judgments, do their internal representations distinguish between normative frameworks, or collapse ethics into a single acceptability dimension? We probe hidden representations across five ethical frameworks (deontology, utilitarianism, virtue, justice, commonsense) in six LLMs spanning 4B--72B parameters. Our analysis reveals differentiated ethical subspaces with asymmetric transfer patterns -- e.g., deontology probes partially generalize to virtue scenarios while commonsense probes fail catastrophically on justice. Disagreement between deontological and utilitarian probes correlates with higher behavioral entropy across architectures, though this relationship may partly reflect shared sensitivity to scenario difficulty. Post-hoc validation reveals that probes partially depend on surface features of benchmark templates, motivating cautious interpretation. We discuss both the structural insights these methods provide and their epistemological limitations.

Probing Ethical Framework Representations in Large Language Models: Structure, Entanglement, and Methodological Challenges

Abstract

When large language models make ethical judgments, do their internal representations distinguish between normative frameworks, or collapse ethics into a single acceptability dimension? We probe hidden representations across five ethical frameworks (deontology, utilitarianism, virtue, justice, commonsense) in six LLMs spanning 4B--72B parameters. Our analysis reveals differentiated ethical subspaces with asymmetric transfer patterns -- e.g., deontology probes partially generalize to virtue scenarios while commonsense probes fail catastrophically on justice. Disagreement between deontological and utilitarian probes correlates with higher behavioral entropy across architectures, though this relationship may partly reflect shared sensitivity to scenario difficulty. Post-hoc validation reveals that probes partially depend on surface features of benchmark templates, motivating cautious interpretation. We discuss both the structural insights these methods provide and their epistemological limitations.
Paper Structure (64 sections, 2 equations, 14 figures, 1 table)

This paper contains 64 sections, 2 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: UMAP projection of activation patterns at layer 20 in Mistral-7B-Instruct. This visualization illustrates the differentiated yet entangled nature of ethical representations---partial overlap among action-oriented frameworks (red, blue, green) versus separation for character-focused virtue ethics (orange). Note that UMAP projections are for qualitative illustration only; quantitative claims about framework relationships rest on the cross-framework transfer matrices (Figure \ref{['fig:cross_framework']}) and probe accuracy metrics, not visual clustering patterns. The structural entanglement suggested here is validated formally across scales (4B--72B) in Appendix \ref{['app:entanglement']}.
  • Figure 2: Layer-wise emergence and cross-framework structure in Llama-3.3-70B-Instruct. (a) Framework-specific trajectories reveal a hierarchy of learnability. (b) Asymmetric transfer patterns confirm distributed, non-modular encoding.
  • Figure 3: Cross-framework probe analysis for Llama-3.3-70B-Instruct. (a) Accuracy matrices reveal asymmetric transfer. (b) Confidence heatmaps show pervasive overconfidence. (c) ECE confirms severe probe miscalibration on off-diagonal transfers, indicating that representations for different frameworks share structural features that trigger confident (but incorrect) predictions.
  • Figure 4: Probe conflict and behavioral inconsistency in Mistral-7B-Instruct. (Top) The conflict distribution shows a populated high-conflict tail, indicating the model preserves distinct deontological and utilitarian representations that can effectively "disagree." (Bottom) High-conflict scenarios (red) correlate with elevated choice entropy ($r = 0.362$, $p = 0.004$). The relationship may partly reflect shared sensitivity to scenario difficulty (see Limitations).
  • Figure 5: Layer-wise probe accuracy for Gemma-3-4B-Base across 34 layers. Framework-specific trajectories emerge despite limited capacity, with virtue ethics achieving early plateau while utilitarianism shows delayed convergence. The base model exhibits lower absolute performance but preserves relative framework ordering.
  • ...and 9 more figures