Moral Preferences of LLMs Under Directed Contextual Influence

Phil Blandfort; Tushar Karayil; Urja Pawar; Robert Graham; Alex McKenzie; Dmitrii Krasheninnikov

Moral Preferences of LLMs Under Directed Contextual Influence

Phil Blandfort, Tushar Karayil, Urja Pawar, Robert Graham, Alex McKenzie, Dmitrii Krasheninnikov

TL;DR

It is found that contextual influences often significantly shift decisions, even when only superficially relevant, and baseline preferences are a poor predictor of directional steerability, as models can appear baseline-neutral yet exhibit systematic steerability asymmetry under influence.

Abstract

Moral benchmarks for LLMs typically use context-free prompts, implicitly assuming stable preferences. In deployment, however, prompts routinely include contextual signals such as user requests, cues on social norms, etc. that may steer decisions. We study how directed contextual influences reshape decisions in trolley-problem-style moral triage settings. We introduce a pilot evaluation harness for directed contextual influence in trolley-problem-style moral triage: for each demographic factor, we apply matched, direction-flipped contextual influences that differ only in which group they favor, enabling systematic measurement of directional response. We find that: (i) contextual influences often significantly shift decisions, even when only superficially relevant; (ii) baseline preferences are a poor predictor of directional steerability, as models can appear baseline-neutral yet exhibit systematic steerability asymmetry under influence; (iii) influences can backfire: models may explicitly claim neutrality or discount the contextual cue, yet their choices still shift, sometimes in the opposite direction; and (iv) reasoning reduces average sensitivity, but amplifies the effect of biased few-shot examples. Our findings motivate extending moral evaluations with controlled, direction-flipped context manipulations to better characterize model behavior.

Moral Preferences of LLMs Under Directed Contextual Influence

TL;DR

Abstract

Paper Structure (83 sections, 5 equations, 19 figures, 11 tables)

This paper contains 83 sections, 5 equations, 19 figures, 11 tables.

Introduction
Setup
Moral Triage Task
Prompt Template.
Demographic Factors and Group Sizes.
Directed Contextual Influences
Experimental Conditions
Methods
Models
Sampling Procedure
Steerability Metrics
Counts, Frequencies, and Odds.
Influence Effect.
Steerability.
Steerability Asymmetry.
...and 68 more sections

Figures (19)

Figure 1: An example of context influence with factor "young-vs-old". Given the choice between saving 5 young or 6 old people, Deepseek V3.2 (with reasoning) defaults to saving the larger group (the old). Influencing to favour the young succeeds 5/8 times; however, pushing to saving the old backfires and results in the model saving young people more frequently (6/8)! This illustrates asymmetric steerability invisible in context-free evaluation.
Figure 2: Preference shifts under contextual influence for poor-vs-rich, for all models (reasoning disabled). X-axis shows changes in log-odds of choosing B. Gray line at 0 is the baseline; actual baseline frequency of choosing B is shown in green on the right. Red shows effect of influencing toward A; blue shows nudging toward B. Effective influences push red leftward and blue rightward. Steerability s(B) measures blue's rightward shift from baseline; s(A) measures red's leftward shift. Negative values (e.g. blue shifting leftward for Llama and Qwen) indicate backfiring. Steerability asymmetry is when blue shifts further right than red shifts left.
Figure 3: Preference shifts under contextual influence of selected models, for all factors. X-axis shows changes in log-odds of choosing B. Gray line at 0 is the baseline; actual baseline frequency of choosing B is shown in green on the right. Red shows effect of influencing toward A; blue shows nudging toward B. Effective influences push red leftward and blue rightward. Steerability s(B) measures blue's rightward shift from baseline; s(A) measures red's leftward shift.
Figure 4: Steerability magnitude by influence type, split by reasoning condition. Steerability measures the change in log-odds of choosing the targeted option when contextual influence is applied. Reasoning reduces steerability overall and shifts which influences are most effective: emotional appeals and user preferences dominate without reasoning; few-shot examples dominate with reasoning.
Figure 5: Backfiring rates for different types of influence by reasoning condition. Rates are calculated as percentages of cases where influence is statistically significant. For example, a rate 20% means if the contextual influence in the respective condition causes a significant preference shift, in 20% of cases the direction of this shift is opposite to the influence.
...and 14 more figures

Moral Preferences of LLMs Under Directed Contextual Influence

TL;DR

Abstract

Moral Preferences of LLMs Under Directed Contextual Influence

Authors

TL;DR

Abstract

Table of Contents

Figures (19)