Table of Contents
Fetching ...

Benchmarking Gaslighting Negation Attacks Against Multimodal Large Language Models

Bin Zhu, Yinxuan Gui, Huiyan Qi, Jingjing Chen, Chong-Wah Ngo, Ee-Peng Lim

TL;DR

The paper addresses the vulnerability of multimodal large language models to gaslighting negation, where models reverse correct outputs when faced with negated arguments. It introduces GaslightingBench, the first dedicated benchmark with 1,287 MCQs across 20 categories and a negation-prompt generation pipeline to systematically stress-test robustness in conversational settings. Across eight benchmarks and a mix of proprietary and open-source models, the study reveals widespread robustness gaps, with larger and open-source models typically more susceptible, though even advanced reasoning-oriented systems can be misled. The findings underscore the need for improved alignment, calibrated confidence, and evaluation frameworks to ensure reliable, trustworthy multimodal AI in adversarial environments.

Abstract

Multimodal Large Language Models (MLLMs) have exhibited remarkable advancements in integrating different modalities, excelling in complex understanding and generation tasks. Despite their success, MLLMs remain vulnerable to conversational adversarial inputs. In this paper, we systematically study gaslighting negation attacks: a phenomenon where models, despite initially providing correct answers, are persuaded by user-provided negations to reverse their outputs, often fabricating justifications. We conduct extensive evaluations of state-of-the-art MLLMs across diverse benchmarks and observe substantial performance drops when negation is introduced. Notably, we introduce the first benchmark GaslightingBench, specifically designed to evaluate the vulnerability of MLLMs to negation arguments. GaslightingBench consists of multiple-choice questions curated from existing datasets, along with generated negation prompts across 20 diverse categories. Throughout extensive evaluation, we find that proprietary models such as Gemini-1.5-flash and GPT-4o demonstrate better resilience compared to open-source counterparts like Qwen2-VL and LLaVA, though even advanced reasoning-oriented models like Gemini-2.5-Pro remain susceptible. Our category-level analysis further shows that subjective or socially nuanced domains (e.g., Social Relation, Image Emotion) are especially fragile, while more objective domains (e.g., Geography) exhibit relatively smaller but still notable drops. Overall, all evaluated MLLMs struggle to maintain logical consistency under gaslighting negation attack. These findings highlight a fundamental robustness gap and provide insights for developing more reliable and trustworthy multimodal AI systems. Project website: https://yxg1005.github.io/GaslightingNegationAttacks/.

Benchmarking Gaslighting Negation Attacks Against Multimodal Large Language Models

TL;DR

The paper addresses the vulnerability of multimodal large language models to gaslighting negation, where models reverse correct outputs when faced with negated arguments. It introduces GaslightingBench, the first dedicated benchmark with 1,287 MCQs across 20 categories and a negation-prompt generation pipeline to systematically stress-test robustness in conversational settings. Across eight benchmarks and a mix of proprietary and open-source models, the study reveals widespread robustness gaps, with larger and open-source models typically more susceptible, though even advanced reasoning-oriented systems can be misled. The findings underscore the need for improved alignment, calibrated confidence, and evaluation frameworks to ensure reliable, trustworthy multimodal AI in adversarial environments.

Abstract

Multimodal Large Language Models (MLLMs) have exhibited remarkable advancements in integrating different modalities, excelling in complex understanding and generation tasks. Despite their success, MLLMs remain vulnerable to conversational adversarial inputs. In this paper, we systematically study gaslighting negation attacks: a phenomenon where models, despite initially providing correct answers, are persuaded by user-provided negations to reverse their outputs, often fabricating justifications. We conduct extensive evaluations of state-of-the-art MLLMs across diverse benchmarks and observe substantial performance drops when negation is introduced. Notably, we introduce the first benchmark GaslightingBench, specifically designed to evaluate the vulnerability of MLLMs to negation arguments. GaslightingBench consists of multiple-choice questions curated from existing datasets, along with generated negation prompts across 20 diverse categories. Throughout extensive evaluation, we find that proprietary models such as Gemini-1.5-flash and GPT-4o demonstrate better resilience compared to open-source counterparts like Qwen2-VL and LLaVA, though even advanced reasoning-oriented models like Gemini-2.5-Pro remain susceptible. Our category-level analysis further shows that subjective or socially nuanced domains (e.g., Social Relation, Image Emotion) are especially fragile, while more objective domains (e.g., Geography) exhibit relatively smaller but still notable drops. Overall, all evaluated MLLMs struggle to maintain logical consistency under gaslighting negation attack. These findings highlight a fundamental robustness gap and provide insights for developing more reliable and trustworthy multimodal AI systems. Project website: https://yxg1005.github.io/GaslightingNegationAttacks/.

Paper Structure

This paper contains 17 sections, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Examples demonstrate that GPT-5 initially provides correct answers but incorrectly revises its responses when confronted with user-provided negation arguments. GPT-5 shows a tendency to accept misleading inputs, often generating hallucinated explanations to justify the revised answers, a behavior that can be described as a form of "gaslighting". Note: "Gaslighting is the manipulation of someone into questioning their own perception of reality." - Wikipedia.
  • Figure 2: Comparison of MLLMs's performance before (i.e., initial answers) and after gaslighting negation attack, reported as average accuracy across eight benchmarks-MME fu2023mme, MMMU yue2023mmmu, MMMUPro yue2024mmmu-MMMUPro, MMBench liu2025mmbench, PoPE li2023evaluating-Pope, ChartQA masry2022chartqa, AI2Diagram kembhavi2016diagram and MathVista lu2023mathvista). The results highlight the substantial accuracy drop across all models when negation is introduced. More detailed results are available in Table \ref{['table:performance']}.
  • Figure 3: Evaluation pipeline for assessing the robustness of Multimodal Large Language Models (MLLMs) to gaslighting negation attack. The pipeline consists of three key stages: (1) Inputs and Initial Answers: MLLMs receive a variety of question formats as input, including Yes/No, Multiple-Choice, and Free-Form, and their initial answers are recorded. (2) Negation Generation: if the model's initial response is correct, a negation argument is introduced to challenge its answer. Different negation strategies are applied based on the question type. (3) Post-Negation Evaluation: the model's response after negation is analyzed to determine if it maintains consistency or is misled into revising its answer. Post-processing is applied to normalize responses for accurate comparison.
  • Figure 4: The category distribution of GaslightingBench with 20 categories and 1,287 samples. Each category is carefully curated from existing datasets to ensure balanced representation and broad coverage, providing a comprehensive evaluation dataset for assessing MLLMs' vulnerabilities to gaslighting negation attacks.
  • Figure 5: Examples from different categories in the GaslightingBench. The green-highlighted option is correct, while a randomly chosen incorrect option is used to generate the negation argument.
  • ...and 5 more figures