Table of Contents
Fetching ...

MINT: Mitigating Hallucinations in Large Vision-Language Models via Token Reduction

Chao Wang, Jianming Yang, Yang Zhou

TL;DR

The paper addresses hallucinations in large vision-language models by diagnosing decoding-time attention and uncovering redundancy in early-to-mid layers. It introduces MINT, a training-free decoding strategy that (1) selects a small set of key image tokens from shallow layers and masks the rest, and (2) applies contrastive decoding with an adaptive plausibility constraint to recalibrate token probabilities. Across multiple LVLMs and benchmarks, MINT reduces hallucinations by about 4% and enables the model to perceive more visual detail with fewer tokens, demonstrating robust performance gains without additional training. This approach offers a practical, resource-efficient path to more reliable multimodal generation with broad applicability to real-world vision-language tasks.

Abstract

Hallucination has been a long-standing and inevitable problem that hinders the application of Large Vision-Language Models (LVLMs) in domains that require high reliability. Various methods focus on improvement depending on data annotations or training strategies, yet place less emphasis on LLM's inherent problems. To fill this gap, we delve into the attention mechanism of the decoding process in the LVLM. Intriguingly, our investigation uncovers the prevalent attention redundancy within the hierarchical architecture of the LVLM, manifesting as overextended image processing in deep layers and an overabundance of non-essential image tokens. Stemming from the observation, we thus propose MINT, a novel training-free decoding strategy, MItigating hallucinations via tokeN reducTion. Specifically, we dynamically intensify the LVLM's local perception capability by masking its attention to irrelevant image tokens. In addition, we use contrastive decoding that pushes the model to focus more on those key image regions. Our full method aims to guide the model in concentrating more on key visual elements during generation. Extensive experimental results on several popular public benchmarks show that our approach achieves a 4% improvement in mitigating hallucinations caused by distracted perception compared to original models. Meanwhile, our approach is demonstrated to make the model perceive 5% more visual points even though we reduce a suite of image tokens.

MINT: Mitigating Hallucinations in Large Vision-Language Models via Token Reduction

TL;DR

The paper addresses hallucinations in large vision-language models by diagnosing decoding-time attention and uncovering redundancy in early-to-mid layers. It introduces MINT, a training-free decoding strategy that (1) selects a small set of key image tokens from shallow layers and masks the rest, and (2) applies contrastive decoding with an adaptive plausibility constraint to recalibrate token probabilities. Across multiple LVLMs and benchmarks, MINT reduces hallucinations by about 4% and enables the model to perceive more visual detail with fewer tokens, demonstrating robust performance gains without additional training. This approach offers a practical, resource-efficient path to more reliable multimodal generation with broad applicability to real-world vision-language tasks.

Abstract

Hallucination has been a long-standing and inevitable problem that hinders the application of Large Vision-Language Models (LVLMs) in domains that require high reliability. Various methods focus on improvement depending on data annotations or training strategies, yet place less emphasis on LLM's inherent problems. To fill this gap, we delve into the attention mechanism of the decoding process in the LVLM. Intriguingly, our investigation uncovers the prevalent attention redundancy within the hierarchical architecture of the LVLM, manifesting as overextended image processing in deep layers and an overabundance of non-essential image tokens. Stemming from the observation, we thus propose MINT, a novel training-free decoding strategy, MItigating hallucinations via tokeN reducTion. Specifically, we dynamically intensify the LVLM's local perception capability by masking its attention to irrelevant image tokens. In addition, we use contrastive decoding that pushes the model to focus more on those key image regions. Our full method aims to guide the model in concentrating more on key visual elements during generation. Extensive experimental results on several popular public benchmarks show that our approach achieves a 4% improvement in mitigating hallucinations caused by distracted perception compared to original models. Meanwhile, our approach is demonstrated to make the model perceive 5% more visual points even though we reduce a suite of image tokens.

Paper Structure

This paper contains 20 sections, 13 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: The overview of our method MINT. We first add a token selection module inside the LLM Decoder to enable control of the model's dynamic concentrating on key image regions. Subsequently, we use contrastive decoding to amplify the impact of the selected image tokens, offering reference for logit refinement. The final calibrated logit can lead to more reliable token prediction for improved model performance.
  • Figure 2: The overall attention allocation over different types of tokens. We randomly select 500 images from MSCOCO lin2014mscoco and ask the LLaVA-1.5-7B model to respond according to the images. We set the maximum number of new tokens to 20 for simplicity and collect the average attention values using Eqn.\ref{['equa:average_attention']}.
  • Figure 3: The attention allocation in different layers. It can be found that the patterns in the first two layers are completely different from the ones in the 3rd and subsequent layers.
  • Figure 4: The attention heatmap obtained from the 2nd and the 3rd layer of the LVLM when generating the token "[man]". We select the heatmap in the first six heads with the largest average attention value. It can be seen that the model's recognition of the image rapidly stabilizes, instead of being gradually formed from shallow layers to deep layers.
  • Figure 5: A comparative evaluation result on $\mathrm{MME}^P$.
  • ...and 5 more figures