Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models

Seong Hah Cho; Junyi Li; Anna Leshinskaya

Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models

Seong Hah Cho, Junyi Li, Anna Leshinskaya

Abstract

Value alignment of Large Language Models (LLMs) requires us to empirically measure these models' actual, acquired representation of value. Among the characteristics of value representation in humans is that they distinguish among value of different kinds. We investigate whether LLMs likewise distinguish three different kinds of good: moral, grammatical, and economic. By probing model behavior, embeddings, and residual stream activations, we report pervasive cases of value entanglement: a conflation between these distinct representations of value. Specifically, both grammatical and economic valuation was found to be overly influenced by moral value, relative to human norms. This conflation was repaired by selective ablation of the activation vectors associated with morality.

Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models

Abstract

Paper Structure (38 sections, 4 equations, 10 figures, 2 tables)

This paper contains 38 sections, 4 equations, 10 figures, 2 tables.

Introduction
Methods
Stimuli: MoralGrammar68 and MoralEconomic68 Sentences
Closed Source Model Embeddings
Residual Stream Activations
Directional Ablation
Results
Model and Human Behavioral Ratings
Embedding Model Analyses
Residual Stream Activation Projections
Inference Time Interventions
Influence of Pre-training and Post-training on Entanglement
Discussion
Task Instructions and Model Prompts
Human instructions
...and 23 more sections

Figures (10)

Figure 1: Model ratings sentences from MoralGrammar68 (top) and MoralEconomic68 (bottom) for closed (GPT-3.5, GPT-4o mini) and open-source (Qwen2.5 7B, Gemma-2 9B, Mistral-Small 24B) models. Center colors indicate morally good (blue), neutral (white), and morally bad (red) scenarios. Edge colors indicate groups of stimuli varying across grammar or economic scales for a single moral scenario. Shapes and their number of sides indicate the grammar or economic gradient (MoralGrammar68 triangle (Level 1: 0 errors) to circle (Level 4: 4+ errors); MoralEconomic68 triangle (Level 1: $) to circle (Level 4: $$$$)). See \ref{['figure_s11']} for the expanded legend of the individual dots.
Figure 2: Residual stream activations projections onto the grammar attribute vector from MoralGrammar68 (left) and onto the economic attribute vector from MoralEconomic68 (right) stimulus sets, as a function of 3 morality levels, in GPT-3.5, Qwen2.5 7B, Gemma-2 9B, and Mistral-Small 24B. Error bars show mean (center line) ± SEM.
Figure 3: Correlation modulations between human and model ratings and object price and model ratings during double ablation intervention using a morality vector (left), grammar vector (middle), and economic vector (right). Asterisks indicate layers where the correlation changes significantly compared to baseline and control questions (Animal Size).
Figure 4: Comparisons of Qwen2.5 7B residual stream activation projections between the pre-trained only and instruction-tuned models on MoralGrammar68 (left) and MoralEconomic68 (right). Presence of markers (square; circle) indicate that the cross-domain correlation (e.g. morality versus grammar projection values) for the corresponding layer is statistically different from 0. Gray shading indicates that the cross-domain correlation is statistically different between the two model types.
Figure S3: Expanded legend for \ref{['figure_1']} and \ref{['figure_s4']}.
...and 5 more figures

Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models

Abstract

Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models

Authors

Abstract

Table of Contents

Figures (10)