Table of Contents
Fetching ...

CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models

Junho Kim, Hyunjun Kim, Yeonju Kim, Yong Man Ro

TL;DR

Cross-modal hallucination in large multi-modal models undermines reliability by producing descriptions not grounded in visual evidence. The paper presents CODE, a training-free contrastive decoding method that uses self-generated comprehensive image descriptions as a visual counterpoint and dynamically regulates information flow with a token- and distribution-aware α_t, guided by a bounded-divergence metric. By integrating adaptive information constraints, CODE selectively retains informative tokens and suppresses hallucinated ones, achieving improved cross-modal consistency across multiple state-of-the-art LMMs and benchmarks without additional training. The approach demonstrates robust reductions in hallucinations on both discriminative and generative tasks, offering a practical decoding-level solution that can be readily deployed to enhance real-world vision-language AI systems.

Abstract

Large Multi-modal Models (LMMs) have recently demonstrated remarkable abilities in visual context understanding and coherent response generation. However, alongside these advancements, the issue of hallucinations has emerged as a significant challenge, producing erroneous responses that are unrelated to the visual contents. In this paper, we introduce a novel contrastive-based decoding method, COuntering DEscription Contrastive Decoding (CODE), which leverages self-generated descriptions as contrasting references during the decoding phase of LMMs to address hallucination issues. CODE utilizes the comprehensive descriptions from model itself as visual counterpart to correct and improve response alignment with actual visual content. By dynamically adjusting the information flow and distribution of next-token predictions in the LMM's vocabulary, CODE enhances the coherence and informativeness of generated responses. Extensive experiments demonstrate that our method significantly reduces hallucinations and improves cross-modal consistency across various benchmarks and cutting-edge LMMs. Our method provides a simple yet effective decoding strategy that can be integrated to existing LMM frameworks without additional training.

CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models

TL;DR

Cross-modal hallucination in large multi-modal models undermines reliability by producing descriptions not grounded in visual evidence. The paper presents CODE, a training-free contrastive decoding method that uses self-generated comprehensive image descriptions as a visual counterpoint and dynamically regulates information flow with a token- and distribution-aware α_t, guided by a bounded-divergence metric. By integrating adaptive information constraints, CODE selectively retains informative tokens and suppresses hallucinated ones, achieving improved cross-modal consistency across multiple state-of-the-art LMMs and benchmarks without additional training. The approach demonstrates robust reductions in hallucinations on both discriminative and generative tasks, offering a practical decoding-level solution that can be readily deployed to enhance real-world vision-language AI systems.

Abstract

Large Multi-modal Models (LMMs) have recently demonstrated remarkable abilities in visual context understanding and coherent response generation. However, alongside these advancements, the issue of hallucinations has emerged as a significant challenge, producing erroneous responses that are unrelated to the visual contents. In this paper, we introduce a novel contrastive-based decoding method, COuntering DEscription Contrastive Decoding (CODE), which leverages self-generated descriptions as contrasting references during the decoding phase of LMMs to address hallucination issues. CODE utilizes the comprehensive descriptions from model itself as visual counterpart to correct and improve response alignment with actual visual content. By dynamically adjusting the information flow and distribution of next-token predictions in the LMM's vocabulary, CODE enhances the coherence and informativeness of generated responses. Extensive experiments demonstrate that our method significantly reduces hallucinations and improves cross-modal consistency across various benchmarks and cutting-edge LMMs. Our method provides a simple yet effective decoding strategy that can be integrated to existing LMM frameworks without additional training.
Paper Structure (33 sections, 4 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 33 sections, 4 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: The overall decoding procedure of CODE. After LMMs generate detailed description for the visual content by themselves (see instruction in Appendix. \ref{['appendix:a']}), the model recursively outputs logits from each $v$ and $d$. By contrasting between two log-likelihoods, CODE produces more contextual and correct responses that match the given visual content suppressing inconsistent words (catching$\rightarrow$hit).
  • Figure 2: The comparison is based on two benchmarks (MMVP tong2024eyes: multiple choice / LLaVA-Bench liu2023visual: description-level). The plain and dotted bars indicate the results for the models that use self-generated descriptions as visual input replacements and original model with actual visual contents, respectively.
  • Figure 3: Overview of experimental results on $6$ baseline LMMs, $6$ decoding method, and $6$ hallucination benchmarks in spider chart format.
  • Figure 4: An example of token-level case study for CODE. Each row indicates the logit score from visual content $\text{logit}_{v}$, comprehensive description $\text{logit}_{d}$, CODE applied $\text{logit}_{\text{code}}$, respectively.
  • Figure 5: Additional experiments on In-the-Wild benchmarks. Note that, unlike other datasets, OPERA huang2023opera fails to generate consistent responses in real-world datasets using Yi-VL young2024yi.
  • ...and 3 more figures