Table of Contents
Fetching ...

How Do Language Models Process Ethical Instructions? Deliberation, Consistency, and Other-Recognition Across Four Models

Hiroki Fukui

Abstract

Alignment safety research assumes that ethical instructions improve model behavior, but how language models internally process such instructions remains unknown. We conducted over 600 multi-agent simulations across four models (Llama 3.3 70B, GPT-4o mini, Qwen3-Next-80B-A3B, Sonnet 4.5), four ethical instruction formats (none, minimal norm, reasoned norm, virtue framing), and two languages (Japanese, English). Confirmatory analysis fully replicated the Llama Japanese dissociation pattern from a prior study ($\mathrm{BF}_{10} > 10$ for all three hypotheses), but none of the other three models reproduced this pattern, establishing it as model-specific. Three new metrics -- Deliberation Depth (DD), Value Consistency Across Dilemmas (VCAD), and Other-Recognition Index (ORI) -- revealed four distinct ethical processing types: Output Filter (GPT; safe outputs, no processing), Defensive Repetition (Llama; high consistency through formulaic repetition), Critical Internalization (Qwen; deep deliberation, incomplete integration), and Principled Consistency (Sonnet; deliberation, consistency, and other-recognition co-occurring). The central finding is an interaction between processing capacity and instruction format: in low-DD models, instruction format has no effect on internal processing; in high-DD models, reasoned norms and virtue framing produce opposite effects. Lexical compliance with ethical instructions did not correlate with any processing metric at the cell level ($r = -0.161$ to $+0.256$, all $p > .22$; $N = 24$; power limited), suggesting that safety, compliance, and ethical processing are largely dissociable. These processing types show structural correspondence to patterns observed in clinical offender treatment, where formal compliance without internal processing is a recognized risk signal.

How Do Language Models Process Ethical Instructions? Deliberation, Consistency, and Other-Recognition Across Four Models

Abstract

Alignment safety research assumes that ethical instructions improve model behavior, but how language models internally process such instructions remains unknown. We conducted over 600 multi-agent simulations across four models (Llama 3.3 70B, GPT-4o mini, Qwen3-Next-80B-A3B, Sonnet 4.5), four ethical instruction formats (none, minimal norm, reasoned norm, virtue framing), and two languages (Japanese, English). Confirmatory analysis fully replicated the Llama Japanese dissociation pattern from a prior study ( for all three hypotheses), but none of the other three models reproduced this pattern, establishing it as model-specific. Three new metrics -- Deliberation Depth (DD), Value Consistency Across Dilemmas (VCAD), and Other-Recognition Index (ORI) -- revealed four distinct ethical processing types: Output Filter (GPT; safe outputs, no processing), Defensive Repetition (Llama; high consistency through formulaic repetition), Critical Internalization (Qwen; deep deliberation, incomplete integration), and Principled Consistency (Sonnet; deliberation, consistency, and other-recognition co-occurring). The central finding is an interaction between processing capacity and instruction format: in low-DD models, instruction format has no effect on internal processing; in high-DD models, reasoned norms and virtue framing produce opposite effects. Lexical compliance with ethical instructions did not correlate with any processing metric at the cell level ( to , all ; ; power limited), suggesting that safety, compliance, and ethical processing are largely dissociable. These processing types show structural correspondence to patterns observed in clinical offender treatment, where formal compliance without internal processing is a recognized risk signal.

Paper Structure

This paper contains 54 sections, 2 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: DD $\times$ VCAD scatter plot for four models (G0--G3 pooled). Each point represents a model--language combination. The four quadrants correspond to the typology in Table \ref{['tab:typology']}.
  • Figure 2: ORI subtype composition by model and language. Subtypes: instrumental (referencing others as means), interior (referencing others' internal states), contextual (referencing others' situational details).
  • Figure 3: DI by condition (G0--G3) for each model--language combination. Error bars indicate $\pm 1$$SD$.
  • Figure 4: DI variation across conditions as a function of DD. Low-DD models (GPT, Llama) show minimal condition sensitivity; high-DD models (Qwen, Sonnet) show large condition-dependent variation.
  • Figure 5: Forest plot of all 24 confirmatory tests (3 hypotheses $\times$ 4 models $\times$ 2 languages). Effect sizes are Hedges' $g$ with 95% confidence intervals. Filled markers indicate $\mathrm{BF}_{10} > 10$ (strong evidence).
  • ...and 2 more figures