Visual Distraction Undermines Moral Reasoning in Vision-Language Models

Xinyi Yang; Chenheng Xu; Weijun Hong; Ce Mo; Qian Wang; Fang Fang; Yixin Zhu

Visual Distraction Undermines Moral Reasoning in Vision-Language Models

Xinyi Yang, Chenheng Xu, Weijun Hong, Ce Mo, Qian Wang, Fang Fang, Yixin Zhu

Abstract

Moral reasoning is fundamental to safe Artificial Intelligence (AI), yet ensuring its consistency across modalities becomes critical as AI systems evolve from text-based assistants to embodied agents. Current safety techniques demonstrate success in textual contexts, but concerns remain about generalization to visual inputs. Existing moral evaluation benchmarks rely on textonly formats and lack systematic control over variables that influence moral decision-making. Here we show that visual inputs fundamentally alter moral decision-making in state-of-the-art (SOTA) Vision-Language Models (VLMs), bypassing text-based safety mechanisms. We introduce Moral Dilemma Simulation (MDS), a multimodal benchmark grounded in Moral Foundation Theory (MFT) that enables mechanistic analysis through orthogonal manipulation of visual and contextual variables. The evaluation reveals that the vision modality activates intuition-like pathways that override the more deliberate and safer reasoning patterns observed in text-only contexts. These findings expose critical fragilities where language-tuned safety filters fail to constrain visual processing, demonstrating the urgent need for multimodal safety alignment.

Visual Distraction Undermines Moral Reasoning in Vision-Language Models

Abstract

Paper Structure (61 sections, 15 figures, 10 tables)

This paper contains 61 sections, 15 figures, 10 tables.

Introduction
Related Work
Theoretical Foundations of Morality
Moral Evaluation Benchmarks
Investigating Morality in
The mds
Generation Pipeline
Dataset Construction
Quantity
Single Feature
Interaction
Semantic Validation of Visual Contexts
The Diagnostic Evaluation Protocol
Text Mode
Caption Mode
...and 46 more sections

Figures (15)

Figure 1: Visual modality distracts moral decision-making in vlm. Compared to text-only scenarios, visual inputs cause models to (a) lose sensitivity to numerical stakes in utilitarian trade-offs, responding indiscriminately regardless of lives saved; (b) prioritize self-interest over loyalty to friends; and (c) collapse hierarchical social values, treating demographically distinct groups as equivalent. Together, these failures reveal (d) a fundamental vulnerability introduced by visual distraction: visual inputs bypass language-level safety filters, producing misaligned outputs that text-based alignment cannot prevent.
Figure 2: The mds generation pipeline. Grounded in mft, each dilemma is framed as a moral conflict either within a single dimension or across two dimensions (green block). A controllable generation engine then orthogonally manipulates conceptual variables (personal force, intention of harm, self-benefit) and character variables (species, race, profession, age) to configure the dilemma (orange block). The resulting configuration populates a description template, which is rewritten by GPT for fluency, while visual scene elements are randomly sampled for diversity. Each generated sample (blue block) comprises a rendered image embedding both the visual scene and the dilemma description, paired with a structured configuration file that records the ground truth of all controlled variables.
Figure 3: Semantic validation of visual contexts. t-SNE projection of word embeddings (dots) from Gemini-generated image captions shows distinct clustering by mft dimensions (stars). Words characteristic of each dimension form well-separated semantic clusters, for instance, Authority terms (e.g., law, duty) and Purity terms (e.g., hygiene, unhygienic) are clearly distinct from Care and Fairness. This confirms that the generated visual scenarios preserve the intended moral distinctions.
Figure 4: The tri-modal evaluation protocol. Three evaluation modes are applied to the same underlying dilemma: Text Mode (top) presents the ground-truth structured description; Caption Mode (middle) requires the model to first generate a visual caption and extract the embedded text via ocr, then reason from these outputs; Image Mode (bottom) provides the rendered image directly. This design decomposes the overall modality gap into a context gap (Text vs. Caption Mode, attributable to informational complexity) and a modality gap (Caption vs. Image Mode, attributable to visual processing itself).
Figure 5: Action probability curves across utilitarian ratios. The x-axis shows the ratio of lives saved to lives sacrificed, and the y-axis indicates action probability. In Text and Caption Modes, most models exhibit rational S-shaped curves whose slope reflects sensitivity to quantitative stakes. In Image Mode, these curves frequently flatten, indicating that visual input decouples decisions from utilitarian reasoning. LLaVA-v1.6-34B represents the most extreme case, with action probability collapsing to near 1.0 in Image Mode regardless of ratio. Best viewed as vector graphics; zoom in for details.
...and 10 more figures

Visual Distraction Undermines Moral Reasoning in Vision-Language Models

Abstract

Visual Distraction Undermines Moral Reasoning in Vision-Language Models

Authors

Abstract

Table of Contents

Figures (15)