Table of Contents
Fetching ...

Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models

Ji Ma, Wei Suo, Peng Wang, Yanning Zhang

Abstract

Multimodal Chain-of-Thought (MCoT) models have demonstrated impressive capability in complex visual reasoning tasks. Unfortunately, recent studies reveal that they suffer from severe hallucination problems due to diminished visual attention during the generation process. However, visual attention decay is a well-studied problem in Large Vision-Language Models (LVLMs). Considering the fundamental differences in reasoning processes between MCoT models and traditional LVLMs, we raise a basic question: Whether MCoT models have unique causes of hallucinations? To answer this question, we systematically investigate the hallucination patterns of MCoT models and find that fabricated texts are primarily generated in associative reasoning steps, which we term divergent thinking. Leveraging these insights, we introduce a simple yet effective strategy that can effectively localize divergent thinking steps and intervene in the decoding process to mitigate hallucinations. Extensive experiments show that our method outperforms existing methods by a large margin. More importantly, our proposed method can be conveniently integrated with other hallucination mitigation methods and further boost their performance. The code is publicly available at https://github.com/ASGO-MM/MCoT-hallucination.

Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models

Abstract

Multimodal Chain-of-Thought (MCoT) models have demonstrated impressive capability in complex visual reasoning tasks. Unfortunately, recent studies reveal that they suffer from severe hallucination problems due to diminished visual attention during the generation process. However, visual attention decay is a well-studied problem in Large Vision-Language Models (LVLMs). Considering the fundamental differences in reasoning processes between MCoT models and traditional LVLMs, we raise a basic question: Whether MCoT models have unique causes of hallucinations? To answer this question, we systematically investigate the hallucination patterns of MCoT models and find that fabricated texts are primarily generated in associative reasoning steps, which we term divergent thinking. Leveraging these insights, we introduce a simple yet effective strategy that can effectively localize divergent thinking steps and intervene in the decoding process to mitigate hallucinations. Extensive experiments show that our method outperforms existing methods by a large margin. More importantly, our proposed method can be conveniently integrated with other hallucination mitigation methods and further boost their performance. The code is publicly available at https://github.com/ASGO-MM/MCoT-hallucination.

Paper Structure

This paper contains 23 sections, 3 equations, 9 figures, 5 tables, 1 algorithm.

Figures (9)

  • Figure 1: Visualization of different reasoning paradigms. (a) Traditional LVLMs apply an implicit reasoning paradigm. (b) MCoT models reason explicitly and answers are influenced by the thinking process.
  • Figure 2: (a) Co-occurrence relationship of hallucinations in the MCoT model's thinking and answering. We find that hallucinations in thinking and answering are positively correlated. (b) Further attention visualization reveals that this phenomenon is caused by the MCoT model's attention bias towards its thinking process when generating final answers.
  • Figure 3: Hallucination patterns in the MCoT model's thinking process. We find that models are prone to hallucinations during associative reasoning steps, termed as divergent thinking. In this thinking mode, MCoT models exhibit $\sim$5 times more hallucinations compared to normal thinking.
  • Figure 4: (a) Visual entropy across different thinking modes. The results show that models exhibit higher values when engaging in divergent thinking. (b) Predicting the divergent thinking mode using visual entropy. The logistic curve demonstrates that high visual entropy reliably predicts the divergent thinking steps.
  • Figure 5: Visualization of visual entropy. We randomly choose ten image tokens for clarity, and dark colors denote low entropy values. (a) Compared to normal thinking, entropy values are significantly higher in divergent thinking steps. (b) After applying our method, visual entropy shows an evident decline.
  • ...and 4 more figures