Table of Contents
Fetching ...

Debate with Images: Detecting Deceptive Behaviors in Multimodal Large Language Models

Sitong Fang, Shiyi Hou, Kaile Wang, Boyuan Chen, Donghai Hong, Jiayi Zhou, Josef Dai, Yaodong Yang, Jiaming Ji

TL;DR

This work addresses the emergent safety risk of deception in multimodal large language models by introducing MM-DeceptionBench, the first benchmark focused on multimodal deception across six behavioral categories. To monitor and detect such deception, the authors propose debate with images, a visually grounded multi-agent evaluation framework that requires evidence-backed arguments. Across open-source and frontier models, the approach yields higher alignment with human judgments and improved deception detection compared with text-only or non-visual baselines, demonstrating broader applicability to multimodal safety tasks. The paper also discusses practical considerations, including annotation ethics, dual-use risks, and computational overhead, arguing that the safety benefits justify the additional cost and that the framework can guide future scalable monitoring efforts.

Abstract

Are frontier AI systems becoming more capable? Certainly. Yet such progress is not an unalloyed blessing but rather a Trojan horse: behind their performance leaps lie more insidious and destructive safety risks, namely deception. Unlike hallucination, which arises from insufficient capability and leads to mistakes, deception represents a deeper threat in which models deliberately mislead users through complex reasoning and insincere responses. As system capabilities advance, deceptive behaviours have spread from textual to multimodal settings, amplifying their potential harm. First and foremost, how can we monitor these covert multimodal deceptive behaviors? Nevertheless, current research remains almost entirely confined to text, leaving the deceptive risks of multimodal large language models unexplored. In this work, we systematically reveal and quantify multimodal deception risks, introducing MM-DeceptionBench, the first benchmark explicitly designed to evaluate multimodal deception. Covering six categories of deception, MM-DeceptionBench characterizes how models strategically manipulate and mislead through combined visual and textual modalities. On the other hand, multimodal deception evaluation is almost a blind spot in existing methods. Its stealth, compounded by visual-semantic ambiguity and the complexity of cross-modal reasoning, renders action monitoring and chain-of-thought monitoring largely ineffective. To tackle this challenge, we propose debate with images, a novel multi-agent debate monitor framework. By compelling models to ground their claims in visual evidence, this method substantially improves the detectability of deceptive strategies. Experiments show that it consistently increases agreement with human judgements across all tested models, boosting Cohen's kappa by 1.5x and accuracy by 1.25x on GPT-4o.

Debate with Images: Detecting Deceptive Behaviors in Multimodal Large Language Models

TL;DR

This work addresses the emergent safety risk of deception in multimodal large language models by introducing MM-DeceptionBench, the first benchmark focused on multimodal deception across six behavioral categories. To monitor and detect such deception, the authors propose debate with images, a visually grounded multi-agent evaluation framework that requires evidence-backed arguments. Across open-source and frontier models, the approach yields higher alignment with human judgments and improved deception detection compared with text-only or non-visual baselines, demonstrating broader applicability to multimodal safety tasks. The paper also discusses practical considerations, including annotation ethics, dual-use risks, and computational overhead, arguing that the safety benefits justify the additional cost and that the framework can guide future scalable monitoring efforts.

Abstract

Are frontier AI systems becoming more capable? Certainly. Yet such progress is not an unalloyed blessing but rather a Trojan horse: behind their performance leaps lie more insidious and destructive safety risks, namely deception. Unlike hallucination, which arises from insufficient capability and leads to mistakes, deception represents a deeper threat in which models deliberately mislead users through complex reasoning and insincere responses. As system capabilities advance, deceptive behaviours have spread from textual to multimodal settings, amplifying their potential harm. First and foremost, how can we monitor these covert multimodal deceptive behaviors? Nevertheless, current research remains almost entirely confined to text, leaving the deceptive risks of multimodal large language models unexplored. In this work, we systematically reveal and quantify multimodal deception risks, introducing MM-DeceptionBench, the first benchmark explicitly designed to evaluate multimodal deception. Covering six categories of deception, MM-DeceptionBench characterizes how models strategically manipulate and mislead through combined visual and textual modalities. On the other hand, multimodal deception evaluation is almost a blind spot in existing methods. Its stealth, compounded by visual-semantic ambiguity and the complexity of cross-modal reasoning, renders action monitoring and chain-of-thought monitoring largely ineffective. To tackle this challenge, we propose debate with images, a novel multi-agent debate monitor framework. By compelling models to ground their claims in visual evidence, this method substantially improves the detectability of deceptive strategies. Experiments show that it consistently increases agreement with human judgements across all tested models, boosting Cohen's kappa by 1.5x and accuracy by 1.25x on GPT-4o.

Paper Structure

This paper contains 62 sections, 1 theorem, 9 equations, 18 figures, 6 tables, 1 algorithm.

Key Result

Proposition 1

Let $\gamma \in (0,1)$ be the per-round information retention rate, after $n$ rounds of debate, where $\bm{D}_{k}^{text}$ denotes the textual history of a text-only debate process after round $k\;(k\leq n)$, and $I(\cdot;\cdot)$ denotes the mutual information.

Figures (18)

  • Figure 1: Defining multimodal deception through three distinct behavioral patterns.Left: In textual settings, well-aligned LLMs typically maintain honesty when provided with accurate descriptions, correctly identifying a deer despite conflicting human beliefs. Center: Multimodal deception occurs when MLLMs demonstrate deliberate contradiction between visual interpretation and user-facing responses to cater to human beliefs. Right: Hallucination represents a distinct failure mode where MLLMs incorrectly process visual inputs, leading to systematic misidentification that coincidentally aligns with human beliefs. This taxonomy distinguishes multimodal deception from perceptual failures and capability insufficiency.
  • Figure 2: The composition of MM-DeceptionBench. (a) Six categories of deceptive behaviors. (b) K-Means clustering of image embeddings illustrates diverse visual content. (c) Pairwise correlation heatmaps indicate balanced category representation. (d) Example from Deliberate Omission: an AI assistant highlights positive features while ignoring visible pollution in promotional copywriting. (e) A four-stage annotation pipeline ensures benchmark quality, including annotator training with deception taxonomy, iterative case development with scenario design and pressure factors, real-time model testing with refinement, and cross-annotator validation with panel review.
  • Figure 3: Debate with images: A multi-agent evaluation framework for detecting multimodal deception.Top: Comparison of three evaluation approaches. Left: Single Agent Judge provides a direct assessment but lacks robustness. Center: Debate about images conducts multi-agent debate but without visual grounding. Right: Our proposed debate with images framework combines multi-agent debate with explicit visual evidence grounding through specialized visual operations. Bottom: Detailed workflow showing how two MLLMs engage in structured debate across multiple rounds, with each model performing different visual operations to support their arguments. This framework enhances detectability by forcing models to justify claims with explicit cross-modal grounding, leading to more reliable multimodal deception evaluation.
  • Figure 4: Effect of the number of agents and rounds on deception detection performance.Left: Percentage of deception detected. Right: Accuracy representing human agreement.
  • Figure 5: The number of visual operations vs accuracy and deception detected rate.
  • ...and 13 more figures

Theorems & Definitions (3)

  • Proposition 1: Visual Grounding Slows Information Decay
  • Remark 2: Asymmetric Deception Difficulty
  • proof