Table of Contents
Fetching ...

More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models

Chengzhi Liu, Zhongxing Xu, Qingyue Wei, Juncheng Wu, James Zou, Xin Eric Wang, Yuyin Zhou, Sheng Liu

TL;DR

The paper investigates how extended reasoning in multimodal large language models can degrade visual grounding, leading to hallucinations. It introduces RH-AUC as an area-under-the-curve metric and RH-Bench as a diagnostic dataset to quantify the trade-off between reasoning prowess and perceptual fidelity across varying reasoning lengths. Key findings show that larger models often balance reasoning and perception better, while training data type and domain exert more influence than sheer data volume; RL-only training generally yields a more adaptive balance than SFT+RL. The work emphasizes the need for evaluation frameworks that jointly consider reasoning depth and perceptual grounding to steer progress in multimodal reasoning systems.

Abstract

Test-time compute has empowered multimodal large language models to generate extended reasoning chains, yielding strong performance on tasks such as multimodal math reasoning. However, this improved reasoning ability often comes with increased hallucination: as generations become longer, models tend to drift away from image-grounded content and rely more heavily on language priors. Attention analysis shows that longer reasoning chains lead to reduced focus on visual inputs, which contributes to hallucination. To systematically study this phenomenon, we introduce RH-AUC, a metric that quantifies how a model's perception accuracy changes with reasoning length, allowing us to evaluate whether the model preserves visual grounding during reasoning. We also release RH-Bench, a diagnostic benchmark that spans a variety of multimodal tasks, designed to assess the trade-off between reasoning ability and hallucination. Our analysis reveals that (i) larger models typically achieve a better balance between reasoning and perception, and (ii) this balance is influenced more by the types and domains of training data than by its overall volume. These findings underscore the importance of evaluation frameworks that jointly consider both reasoning quality and perceptual fidelity.

More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models

TL;DR

The paper investigates how extended reasoning in multimodal large language models can degrade visual grounding, leading to hallucinations. It introduces RH-AUC as an area-under-the-curve metric and RH-Bench as a diagnostic dataset to quantify the trade-off between reasoning prowess and perceptual fidelity across varying reasoning lengths. Key findings show that larger models often balance reasoning and perception better, while training data type and domain exert more influence than sheer data volume; RL-only training generally yields a more adaptive balance than SFT+RL. The work emphasizes the need for evaluation frameworks that jointly consider reasoning depth and perceptual grounding to steer progress in multimodal reasoning systems.

Abstract

Test-time compute has empowered multimodal large language models to generate extended reasoning chains, yielding strong performance on tasks such as multimodal math reasoning. However, this improved reasoning ability often comes with increased hallucination: as generations become longer, models tend to drift away from image-grounded content and rely more heavily on language priors. Attention analysis shows that longer reasoning chains lead to reduced focus on visual inputs, which contributes to hallucination. To systematically study this phenomenon, we introduce RH-AUC, a metric that quantifies how a model's perception accuracy changes with reasoning length, allowing us to evaluate whether the model preserves visual grounding during reasoning. We also release RH-Bench, a diagnostic benchmark that spans a variety of multimodal tasks, designed to assess the trade-off between reasoning ability and hallucination. Our analysis reveals that (i) larger models typically achieve a better balance between reasoning and perception, and (ii) this balance is influenced more by the types and domains of training data than by its overall volume. These findings underscore the importance of evaluation frameworks that jointly consider both reasoning quality and perceptual fidelity.

Paper Structure

This paper contains 16 sections, 3 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: (a) Example of outputs from a reasoning model and a non-reasoning model on a perception task. Red highlights indicate visual hallucination. Multimodal reasoning models are generally more prone to amplifying hallucinations during the reasoning process compared to their non-reasoning counterparts.(b) Performance of different models on reasoning and perception tasks in the RH-Bench dataset. Better performing models are positioned in the upper right corner. Baseline non-reasoning models of varying scales typically exhibit weaker reasoning capabilities and fewer hallucination, whereas reasoning models display the opposite trend.
  • Figure 2: Comparison of reasoning and non-reasoning models on five perception benchmarks. Results are shown for 3B models (left) and 7B models (right). Higher scores indicate lower hallucination.
  • Figure 3: Performance across four perception benchmarks comparing Base, RL, and SFT+RL.
  • Figure 4: Two common types of hallucination patterns observed in multimodal reasoning models. (a) corresponds to hallucinations caused by visual misrecognition, while (b) reflects hallucinations arising from reasoning biases. Hallucinated spans are highlighted in red.
  • Figure 5: Attention allocation and visual grounding between reasoning and non reasoning models. The reduction of visual attention in reasoning models amplifies visual hallucinations.
  • ...and 3 more figures