Table of Contents
Fetching ...

Moral Sycophancy in Vision Language Models

Shadman Rabby, Md. Hefzul Hossain Papon, Sabbir Ahmed, Nokimul Hasan Arif, A. B. M. Ashikur Rahman, Irfan Ahmad

TL;DR

This study systematically evaluates moral sycophancy in vision–language models under explicit user disagreement using two benchmarks, Moralise and $M^3$oralBench, across ten diverse models. A two-turn prompting protocol reveals an asymmetric tendency for models to shift from morally right to morally wrong judgments when challenged, with $EIR$ and $ECR$ exposing a trade-off between stable reasoning and adaptability. Results show open-source models are more susceptible to moral drift than proprietary systems, and dataset characteristics strongly influence robustness and topic-specific vulnerabilities. The work highlights normative instability in multimodal ethical reasoning and motivates development of mitigation strategies to enhance moral consistency in VLMs for safer, more trustworthy AI assistants.

Abstract

Sycophancy in Vision-Language Models (VLMs) refers to their tendency to align with user opinions, often at the expense of moral or factual accuracy. While prior studies have explored sycophantic behavior in general contexts, its impact on morally grounded visual decision-making remains insufficiently understood. To address this gap, we present the first systematic study of moral sycophancy in VLMs, analyzing ten widely-used models on the Moralise and M^3oralBench datasets under explicit user disagreement. Our results reveal that VLMs frequently produce morally incorrect follow-up responses even when their initial judgments are correct, and exhibit a consistent asymmetry: models are more likely to shift from morally right to morally wrong judgments than the reverse when exposed to user-induced bias. Follow-up prompts generally degrade performance on Moralise, while yielding mixed or even improved accuracy on M^3oralBench, highlighting dataset-dependent differences in moral robustness. Evaluation using Error Introduction Rate (EIR) and Error Correction Rate (ECR) reveals a clear trade-off: models with stronger error-correction capabilities tend to introduce more reasoning errors, whereas more conservative models minimize errors but exhibit limited ability to self-correct. Finally, initial contexts with a morally right stance elicit stronger sycophantic behavior, emphasizing the vulnerability of VLMs to moral influence and the need for principled strategies to improve ethical consistency and robustness in multimodal AI systems.

Moral Sycophancy in Vision Language Models

TL;DR

This study systematically evaluates moral sycophancy in vision–language models under explicit user disagreement using two benchmarks, Moralise and oralBench, across ten diverse models. A two-turn prompting protocol reveals an asymmetric tendency for models to shift from morally right to morally wrong judgments when challenged, with and exposing a trade-off between stable reasoning and adaptability. Results show open-source models are more susceptible to moral drift than proprietary systems, and dataset characteristics strongly influence robustness and topic-specific vulnerabilities. The work highlights normative instability in multimodal ethical reasoning and motivates development of mitigation strategies to enhance moral consistency in VLMs for safer, more trustworthy AI assistants.

Abstract

Sycophancy in Vision-Language Models (VLMs) refers to their tendency to align with user opinions, often at the expense of moral or factual accuracy. While prior studies have explored sycophantic behavior in general contexts, its impact on morally grounded visual decision-making remains insufficiently understood. To address this gap, we present the first systematic study of moral sycophancy in VLMs, analyzing ten widely-used models on the Moralise and M^3oralBench datasets under explicit user disagreement. Our results reveal that VLMs frequently produce morally incorrect follow-up responses even when their initial judgments are correct, and exhibit a consistent asymmetry: models are more likely to shift from morally right to morally wrong judgments than the reverse when exposed to user-induced bias. Follow-up prompts generally degrade performance on Moralise, while yielding mixed or even improved accuracy on M^3oralBench, highlighting dataset-dependent differences in moral robustness. Evaluation using Error Introduction Rate (EIR) and Error Correction Rate (ECR) reveals a clear trade-off: models with stronger error-correction capabilities tend to introduce more reasoning errors, whereas more conservative models minimize errors but exhibit limited ability to self-correct. Finally, initial contexts with a morally right stance elicit stronger sycophantic behavior, emphasizing the vulnerability of VLMs to moral influence and the need for principled strategies to improve ethical consistency and robustness in multimodal AI systems.
Paper Structure (22 sections, 3 equations, 6 figures, 8 tables)

This paper contains 22 sections, 3 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Illustration of 'moral sycophancy', where a VLM initially judges the depicted behavior as 'not morally wrong'. After user disagreement, it revises its conclusion to 'morally wrong' without new evidence, demonstrating a user-induced shift in moral judgment.
  • Figure 2: Primary vs. follow-up accuracy of VLMs under user disagreement. (a)Moralise: accuracy typically drops after follow-up prompts. (b)M$^3$oralBench: the trend reverses for several models, with follow-up prompts sometimes improving accuracy.
  • Figure 3: Comparison of Error Introduction Rate (EIR) and Error Correction Rate (ECR) across VLMs under moral evaluation. (a) Results on the Moralise dataset show high variability with no consistent relationship between error introduction and correction behavior. (b) Results on the M$^3$oralBench dataset exhibit substantial randomness across models, similarly indicating the absence of a systematic performance trend.
  • Figure 4: EIR–ECR trade-off on the Moralise and M$^3$oralBench datasets. The figure shows that high task accuracy does not guarantee strong self-correction: models like GPT-4o and Gemini-2.5-Pro recover poorly from induced errors, while mid-sized open-source models (e.g., Qwen-VL-Max, InternVL2.5-8B) achieve a more favorable balance of lower EIR and higher ECR. Models with low initial accuracy (e.g., Qwen2-VL-2B, Gemini-2.5-Flash-Lite) exhibit high ECR, though often due to unstable output shifts rather than genuine correction. Overall, robustness to induced reasoning errors appears largely orthogonal to baseline accuracy.
  • Figure 5: Two–turn prompting setup.Round 1: image $x$ + choice query $\rightarrow$ model outputs A/B. Round 2: image $x$ + conversation history + disagreement prompt $\rightarrow$ model outputs A/B with one-line justification. Attachments indicated at the bottom of each panel mirror what is provided to the model in that round.
  • ...and 1 more figures