Are Language Models Sensitive to Morally Irrelevant Distractors?

Andrew Shaw; Christina Hahn; Catherine Rasgaitis; Yash Mishra; Alisa Liu; Natasha Jaques; Yulia Tsvetkov; Amy X. Zhang

Are Language Models Sensitive to Morally Irrelevant Distractors?

Andrew Shaw, Christina Hahn, Catherine Rasgaitis, Yash Mishra, Alisa Liu, Natasha Jaques, Yulia Tsvetkov, Amy X. Zhang

TL;DR

The paper investigates whether LLMs exhibit context-sensitive moral judgments by introducing a multimodal set of 60 morally irrelevant distractors and applying them to two moral benchmarks, MoralChoice and r/AITA, across four model families. It demonstrates that negative distractors can shift judgments by over 30%, with reasoning and ablations modulating the effect, and that these biases persist across textual and visual modalities. The findings align with situationist ideas and highlight the need for contextual AI alignment, safety testing under varied distractors, and shifting responsibility toward interface designers and deployment contexts. The work emphasizes that LLMs are not universal moral reasoners but context-responsive tools whose outputs can be steered by incidental affect and prompts, urging more nuanced evaluation in real-world, distractor-rich settings.

Abstract

With the rapid development and uptake of large language models (LLMs) across high-stakes settings, it is increasingly important to ensure that LLMs behave in ways that align with human values. Existing moral benchmarks prompt LLMs with value statements, moral scenarios, or psychological questionnaires, with the implicit underlying assumption that LLMs report somewhat stable moral preferences. However, moral psychology research has shown that human moral judgements are sensitive to morally irrelevant situational factors, such as smelling cinnamon rolls or the level of ambient noise, thereby challenging moral theories that assume the stability of human moral judgements. Here, we draw inspiration from this "situationist" view of moral psychology to evaluate whether LLMs exhibit similar cognitive moral biases to humans. We curate a novel multimodal dataset of 60 "moral distractors" from existing psychological datasets of emotionally-valenced images and narratives which have no moral relevance to the situation presented. After injecting these distractors into existing moral benchmarks to measure their effects on LLM responses, we find that moral distractors can shift the moral judgements of LLMs by over 30% even in low-ambiguity scenarios, highlighting the need for more contextual moral evaluations and more nuanced cognitive moral modeling of LLMs.

Are Language Models Sensitive to Morally Irrelevant Distractors?

TL;DR

Abstract

Paper Structure (40 sections, 2 equations, 12 figures, 8 tables)

This paper contains 40 sections, 2 equations, 12 figures, 8 tables.

Introduction
Related Work
Benchmarking the Morality of LLMs
Cognitive Biases in LLMs
The Person-Situationism Debate
Methods
Moral Distractors
Textual Distractors
Visual Distractors
Moral Benchmarks
MoralChoice
r/AITA
Experimental Setup
Results
MoralChoice Results
...and 25 more sections

Figures (12)

Figure 1: Example of an immoral response induced by a negative visual distractor (gemma-3-4b-it).
Figure 2: In high-ambiguity scenarios with textual distractors, positive distractors increase/negative distractors decrease the marginal probability of a moral action (MMAP) compared to the baseline no distractor condition.
Figure 3: In low-ambiguity scenarios with textual distractors, negative distractors decrease the marginal probability of a moral action (MMAP) compared to the baseline no distractor condition by up to 30+%, while positive distractors have less of an effect due to prior model alignment.
Figure 4: Example of immoral response induced by negative textual distractor (Llama-3.2-3B-Instruct).
Figure 5: Across high- and low-ambiguity scenarios, visual distractors induce similar effects to textual distractors for gemma-3-4b-it.
...and 7 more figures

Are Language Models Sensitive to Morally Irrelevant Distractors?

TL;DR

Abstract

Are Language Models Sensitive to Morally Irrelevant Distractors?

Authors

TL;DR

Abstract

Table of Contents

Figures (12)