Table of Contents
Fetching ...

DOPRA: Decoding Over-accumulation Penalization and Re-allocation in Specific Weighting Layer

Jinfeng Wei, Xiaofeng Zhang

TL;DR

This work targets hallucinations in multimodal large language models by analyzing self-attention dynamics and identifying summary tokens as a key source of erroneous, image-inaccurate outputs. It introduces DOPRA, a decoding-time framework featuring an over-accumulation penalty applied to a middle Transformer layer (notably layer 12) and a retrospective reallocation mechanism to recover from persistent aggregation patterns, all without requiring additional training data or external knowledge sources. The approach is complemented by visualizations that link generated text to high-response image regions, enabling interpretable cross-modal grounding. Empirical results on CHAIR and POPE benchmarks show that DOPRA reduces hallucinations across long and short captions and various model configurations, demonstrating a practical, cost-free improvement to MLLM reliability. These findings advance the reliability of multimodal systems and highlight decoding-time interventions as a viable route to mitigating hallucinations in real-world applications.

Abstract

In this work, we introduce DOPRA, a novel approach designed to mitigate hallucinations in multi-modal large language models (MLLMs). Unlike existing solutions that typically involve costly supplementary training data or the integration of external knowledge sources, DOPRA innovatively addresses hallucinations by decoding specific weighted layer penalties and redistribution, offering an economical and effective solution without additional resources. DOPRA is grounded in unique insights into the intrinsic mechanisms controlling hallucinations within MLLMs, especially the models' tendency to over-rely on a subset of summary tokens in the self-attention matrix, neglecting critical image-related information. This phenomenon is particularly pronounced in certain strata. To counteract this over-reliance, DOPRA employs a strategy of weighted overlay penalties and redistribution in specific layers, such as the 12th layer, during the decoding process. Furthermore, DOPRA includes a retrospective allocation process that re-examines the sequence of generated tokens, allowing the algorithm to reallocate token selection to better align with the actual image content, thereby reducing the incidence of hallucinatory descriptions in auto-generated captions. Overall, DOPRA represents a significant step forward in improving the output quality of MLLMs by systematically reducing hallucinations through targeted adjustments during the decoding process.

DOPRA: Decoding Over-accumulation Penalization and Re-allocation in Specific Weighting Layer

TL;DR

This work targets hallucinations in multimodal large language models by analyzing self-attention dynamics and identifying summary tokens as a key source of erroneous, image-inaccurate outputs. It introduces DOPRA, a decoding-time framework featuring an over-accumulation penalty applied to a middle Transformer layer (notably layer 12) and a retrospective reallocation mechanism to recover from persistent aggregation patterns, all without requiring additional training data or external knowledge sources. The approach is complemented by visualizations that link generated text to high-response image regions, enabling interpretable cross-modal grounding. Empirical results on CHAIR and POPE benchmarks show that DOPRA reduces hallucinations across long and short captions and various model configurations, demonstrating a practical, cost-free improvement to MLLM reliability. These findings advance the reliability of multimodal systems and highlight decoding-time interventions as a viable route to mitigating hallucinations in real-world applications.

Abstract

In this work, we introduce DOPRA, a novel approach designed to mitigate hallucinations in multi-modal large language models (MLLMs). Unlike existing solutions that typically involve costly supplementary training data or the integration of external knowledge sources, DOPRA innovatively addresses hallucinations by decoding specific weighted layer penalties and redistribution, offering an economical and effective solution without additional resources. DOPRA is grounded in unique insights into the intrinsic mechanisms controlling hallucinations within MLLMs, especially the models' tendency to over-rely on a subset of summary tokens in the self-attention matrix, neglecting critical image-related information. This phenomenon is particularly pronounced in certain strata. To counteract this over-reliance, DOPRA employs a strategy of weighted overlay penalties and redistribution in specific layers, such as the 12th layer, during the decoding process. Furthermore, DOPRA includes a retrospective allocation process that re-examines the sequence of generated tokens, allowing the algorithm to reallocate token selection to better align with the actual image content, thereby reducing the incidence of hallucinatory descriptions in auto-generated captions. Overall, DOPRA represents a significant step forward in improving the output quality of MLLMs by systematically reducing hallucinations through targeted adjustments during the decoding process.
Paper Structure (16 sections, 11 equations, 6 figures, 2 tables)

This paper contains 16 sections, 11 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Compare results of LLaVA-1.5 with DOPRA and OPERA.
  • Figure 2: Attention weighting graph comparison. Layer 8 weight maps are shown on the left and layer 12 weight maps are shown on the right.
  • Figure 3: The structure of Our method. The decoding method uses our proposed DOPRA. "Text-Correlated Attention Heatmap Generator" performs heatmap generation for $E_t$, the pseudo-code of which we put into the supplementary material.
  • Figure 4: The flow chart of DOPRA's decoder. The tokens of LLaVA1.5 are divided into system token, image token, user token and answer token. DOPRA carries out attention accumulation penalty for answer token.
  • Figure 5: Attention compare results of reason tokens.
  • ...and 1 more figures