Table of Contents
Fetching ...

Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs

Sreyan Ghosh, Chandra Kiran Reddy Evuru, Sonal Kumar, Utkarsh Tyagi, Oriol Nieto, Zeyu Jin, Dinesh Manocha

TL;DR

The paper analyzes hallucinations in LVLMs and finds that while current mitigation improves visual recognition, it does not consistently enhance cognitive reasoning. It identifies a visual perception gap where LVLMs can recognize elements but fail to contextually integrate them with prompts, hindering reasoning. To address this, it proposes Visual Description Grounded Decoding (VDGD), a training-free decoding strategy that prefixes a generated image description to the prompt and biases token selection toward words aligned with that description via KL-divergence. Across eight visual reasoning benchmarks and the VaLLu evaluation suite, VDGD yields 2–33% performance gains, underscoring its practical impact for building more reliable LVLMs in reasoning-intensive tasks.

Abstract

Large Vision-Language Models (LVLMs) often produce responses that misalign with factual information, a phenomenon known as hallucinations. While hallucinations are well-studied, the exact causes behind them remain underexplored. In this paper, we first investigate the root causes of hallucinations in LVLMs. Our findings reveal that existing mitigation techniques primarily reduce hallucinations for visual recognition prompts-those that require simple descriptions of visual elements-but fail for cognitive prompts that demand deliberate reasoning. We identify the core issue as a lack of true visual perception in LVLMs: although they can accurately recognize visual elements, they struggle to fully interpret these elements in the context of the input prompt and effectively link this recognition to their internal knowledge, which is critical for reasoning. To address this gap, we introduce Visual Description Grounded Decoding (VDGD), a simple, robust, and training-free method designed to enhance visual perception and improve reasoning capabilities in LVLMs. VDGD works by first generating a detailed description of the image and appending it as a prefix to the instruction. During response generation, tokens are sampled based on their KL divergence to the description, favoring candidates with lower divergence. Experimental results on multiple visual reasoning benchmarks and LVLMs demonstrate that VDGD consistently outperforms existing baselines 2% - 33%. Finally, we introduce VaLLu, a benchmark designed for comprehensive evaluation of the cognitive capabilities of LVLMs.

Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs

TL;DR

The paper analyzes hallucinations in LVLMs and finds that while current mitigation improves visual recognition, it does not consistently enhance cognitive reasoning. It identifies a visual perception gap where LVLMs can recognize elements but fail to contextually integrate them with prompts, hindering reasoning. To address this, it proposes Visual Description Grounded Decoding (VDGD), a training-free decoding strategy that prefixes a generated image description to the prompt and biases token selection toward words aligned with that description via KL-divergence. Across eight visual reasoning benchmarks and the VaLLu evaluation suite, VDGD yields 2–33% performance gains, underscoring its practical impact for building more reliable LVLMs in reasoning-intensive tasks.

Abstract

Large Vision-Language Models (LVLMs) often produce responses that misalign with factual information, a phenomenon known as hallucinations. While hallucinations are well-studied, the exact causes behind them remain underexplored. In this paper, we first investigate the root causes of hallucinations in LVLMs. Our findings reveal that existing mitigation techniques primarily reduce hallucinations for visual recognition prompts-those that require simple descriptions of visual elements-but fail for cognitive prompts that demand deliberate reasoning. We identify the core issue as a lack of true visual perception in LVLMs: although they can accurately recognize visual elements, they struggle to fully interpret these elements in the context of the input prompt and effectively link this recognition to their internal knowledge, which is critical for reasoning. To address this gap, we introduce Visual Description Grounded Decoding (VDGD), a simple, robust, and training-free method designed to enhance visual perception and improve reasoning capabilities in LVLMs. VDGD works by first generating a detailed description of the image and appending it as a prefix to the instruction. During response generation, tokens are sampled based on their KL divergence to the description, favoring candidates with lower divergence. Experimental results on multiple visual reasoning benchmarks and LVLMs demonstrate that VDGD consistently outperforms existing baselines 2% - 33%. Finally, we introduce VaLLu, a benchmark designed for comprehensive evaluation of the cognitive capabilities of LVLMs.
Paper Structure (48 sections, 4 equations, 42 figures, 15 tables, 1 algorithm)

This paper contains 48 sections, 4 equations, 42 figures, 15 tables, 1 algorithm.

Figures (42)

  • Figure 1: Depending on the text instruction, an LVLM might be assessed on one or more different capabilities.
  • Figure 2: (Left) Performance comparison of different LVLMs on various benchmarks. (Right) Performance comparison of different hallucination mitigation techniques applied to LLaVA-1.5.
  • Figure 3: (Left) Performance comparison of different LVLMs on various benchmarks when prompted to only describe the image. (Right) Performance comparison of different hallucination mitigation techniques applied to LLaVA-1.5.
  • Figure 4: Types of Visual Recognition Hallucinations. We define Algo. \ref{['algo:hall']} to divide VR hallucinations into 4 different categories automatically: Language, Vision, Style, and IT (explained further in Sec. \ref{['sec:category_visual_hallucinations']}). While language and vision hallucinations have been explored earlier, and methods to alleviate them have been proposed, we show for the first time that Style and IT hallucinations exist and existing methods fail to alleviate them.
  • Figure 5: (Left) Frequency comparison of hallucination categories. (Right) Comparison for LLaVA-v1.5 with hallucination mitigation techniques. The top graph compares AMBER, and the bottom graph compares MMMU.
  • ...and 37 more figures