The Mirage of Multimodality: Where Truth is Tested and Honesty Unravels
Jiaming Ji, Sitong Fang, Wenjing Cao, Jiahao Li, Xuyao Wang, Juntao Dai, Chi-Min Chan, Sirui Han, Yike Guo, Yaodong Yang
TL;DR
The study reveals that slower, depth-first multimodal reasoning can produce plausible but false details when visual inputs are ambiguous, coining the Mirage of Multimodality. To address this, it introduces Truthfulvqa, a 5,000-image benchmark with hierarchical prompts and rigorous human-in-the-loop validation to assess honesty across increasing reasoning depth and eight deception categories. It also presents TruthfulJudge, a specialized judge model trained on human critiques to reliably evaluate model outputs, demonstrating superior calibration and alignment with human judgments. Empirical results show chat models generally outperform reasoning-augmented ones in truthfulness, with a notable decline in accuracy as prompts become more deceptive, underscoring the need for improved honesty-alignment in multimodal systems. The work provides a scalable evaluation framework and highlights both the promise and limitations of automated judging in complex truthfulness tasks.
Abstract
Reasoning models have recently attracted significant attention, especially for tasks that involve complex inference. Their strengths exemplify the System II paradigm (slow, structured thinking), contrasting with the System I (rapid, heuristic-driven). Yet, does slower reasoning necessarily lead to greater truthfulness? Our findings suggest otherwise. In this study, we present the first systematic investigation of distortions associated with System I and System II reasoning in multimodal contexts. We demonstrate that slower reasoning models, when presented with incomplete or misleading visual inputs, are more likely to fabricate plausible yet false details to support flawed reasoning -- a phenomenon we term the "Mirage of Multimodality". To examine this, we constructed a 5,000-sample hierarchical prompt dataset annotated by 50 human participants. These prompts gradually increase in complexity, revealing a consistent pattern: slower reasoning models tend to employ depth-first thinking (delving deeper into incorrect premises), whereas faster chat models favor breadth-first inference, exhibiting greater caution under uncertainty. Our results highlight a critical vulnerability of slower reasoning models: although highly effective in structured domains such as mathematics, it becomes brittle when confronted with ambiguous multimodal inputs.
