Table of Contents
Fetching ...

The Mirage of Multimodality: Where Truth is Tested and Honesty Unravels

Jiaming Ji, Sitong Fang, Wenjing Cao, Jiahao Li, Xuyao Wang, Juntao Dai, Chi-Min Chan, Sirui Han, Yike Guo, Yaodong Yang

TL;DR

The study reveals that slower, depth-first multimodal reasoning can produce plausible but false details when visual inputs are ambiguous, coining the Mirage of Multimodality. To address this, it introduces Truthfulvqa, a 5,000-image benchmark with hierarchical prompts and rigorous human-in-the-loop validation to assess honesty across increasing reasoning depth and eight deception categories. It also presents TruthfulJudge, a specialized judge model trained on human critiques to reliably evaluate model outputs, demonstrating superior calibration and alignment with human judgments. Empirical results show chat models generally outperform reasoning-augmented ones in truthfulness, with a notable decline in accuracy as prompts become more deceptive, underscoring the need for improved honesty-alignment in multimodal systems. The work provides a scalable evaluation framework and highlights both the promise and limitations of automated judging in complex truthfulness tasks.

Abstract

Reasoning models have recently attracted significant attention, especially for tasks that involve complex inference. Their strengths exemplify the System II paradigm (slow, structured thinking), contrasting with the System I (rapid, heuristic-driven). Yet, does slower reasoning necessarily lead to greater truthfulness? Our findings suggest otherwise. In this study, we present the first systematic investigation of distortions associated with System I and System II reasoning in multimodal contexts. We demonstrate that slower reasoning models, when presented with incomplete or misleading visual inputs, are more likely to fabricate plausible yet false details to support flawed reasoning -- a phenomenon we term the "Mirage of Multimodality". To examine this, we constructed a 5,000-sample hierarchical prompt dataset annotated by 50 human participants. These prompts gradually increase in complexity, revealing a consistent pattern: slower reasoning models tend to employ depth-first thinking (delving deeper into incorrect premises), whereas faster chat models favor breadth-first inference, exhibiting greater caution under uncertainty. Our results highlight a critical vulnerability of slower reasoning models: although highly effective in structured domains such as mathematics, it becomes brittle when confronted with ambiguous multimodal inputs.

The Mirage of Multimodality: Where Truth is Tested and Honesty Unravels

TL;DR

The study reveals that slower, depth-first multimodal reasoning can produce plausible but false details when visual inputs are ambiguous, coining the Mirage of Multimodality. To address this, it introduces Truthfulvqa, a 5,000-image benchmark with hierarchical prompts and rigorous human-in-the-loop validation to assess honesty across increasing reasoning depth and eight deception categories. It also presents TruthfulJudge, a specialized judge model trained on human critiques to reliably evaluate model outputs, demonstrating superior calibration and alignment with human judgments. Empirical results show chat models generally outperform reasoning-augmented ones in truthfulness, with a notable decline in accuracy as prompts become more deceptive, underscoring the need for improved honesty-alignment in multimodal systems. The work provides a scalable evaluation framework and highlights both the promise and limitations of automated judging in complex truthfulness tasks.

Abstract

Reasoning models have recently attracted significant attention, especially for tasks that involve complex inference. Their strengths exemplify the System II paradigm (slow, structured thinking), contrasting with the System I (rapid, heuristic-driven). Yet, does slower reasoning necessarily lead to greater truthfulness? Our findings suggest otherwise. In this study, we present the first systematic investigation of distortions associated with System I and System II reasoning in multimodal contexts. We demonstrate that slower reasoning models, when presented with incomplete or misleading visual inputs, are more likely to fabricate plausible yet false details to support flawed reasoning -- a phenomenon we term the "Mirage of Multimodality". To examine this, we constructed a 5,000-sample hierarchical prompt dataset annotated by 50 human participants. These prompts gradually increase in complexity, revealing a consistent pattern: slower reasoning models tend to employ depth-first thinking (delving deeper into incorrect premises), whereas faster chat models favor breadth-first inference, exhibiting greater caution under uncertainty. Our results highlight a critical vulnerability of slower reasoning models: although highly effective in structured domains such as mathematics, it becomes brittle when confronted with ambiguous multimodal inputs.

Paper Structure

This paper contains 28 sections, 3 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Mean–variance (inverse-scale) landscape of MLLMs on the benchmark. Each data point represents a mainstream model, with marker size scaled to the number of parameters (ranging from 2B to 72B); API-based systems are included for completeness. Chat models follow a clear scaling-law trajectory, whereas reasoning models depart from this trend and exhibit inferior performance.
  • Figure 2: Overview of the hierarchical Truthfulvqa framework. More than 50 human annotators contributed to the construction of the dataset. Images collected from online sources are paired with hierarchically structured, human-written questions designed to probe different forms of hallucination. Model responses are systematically analyzed across varying levels of truthfulness, offering a comprehensive view of honesty tendencies in MLLMs.
  • Figure 3: Overview and pipeline of the hierarchical Truthfulvqa framework. 50 human annotators contributed to the construction of the dataset. Images collected from online sources are paired with hierarchically structured, human-written questions designed to probe different forms of hallucination.
  • Figure 4: Multi‑perspective evaluation of 50+ models on the Truthfulvqa benchmark.(a): Two-dimensional scatter plot of ECE vs CAI ($\lambda = 1$); chat models outperform their reasoning variants. (b): Density map of mean vs. variance across 3 levels. (c): Violin-box plots of accuracy for the three-tier levels, illustrating the dispersion at each level. (d): Heat map of scores across the eight categories (S1–S8); models are ordered by overall performance, exposing per-category strengths and weaknesses. (e): Three-level stack bar charts of chat and reasoning models. Collectively, the five panels show that chat models constantly beat reasoning models in multimodal truthfulness challenges.
  • Figure 5: Win rate. Comparison of GPT-4o, Lla ma4-Maverick and Qwen2.5-VL-72B, evaluated by Gemini-1.5, TruthfulJudge and human.
  • ...and 4 more figures