Table of Contents
Fetching ...

Revealing Perception and Generation Dynamics in LVLMs: Mitigating Hallucinations via Validated Dominance Correction

Guangtao Lyu, Xinyi Cheng, Chenghao Xu, Qi Liu, Muli Yang, Fen Fang, Huilin Chen, Jiexi Yan, Xu Yang, Cheng Deng

TL;DR

LVLMs struggle with hallucinations despite strong grounding signals. The authors uncover two internal dynamics—GATE patterns in visual perception and SAD patterns in token generation—and introduce Validated Dominance Correction (VDC), a training-free method that validates and replaces unsupported tokens based on cross-layer dominance in attention and FFN. Across POPE, CHAIR, and MME benchmarks and multiple LVLM backbones, VDC consistently reduces hallucinations and improves factual grounding. The work offers a mechanistic, interpretable lens into LVLM behavior and a lightweight, broadly applicable mitigation strategy with practical impact for safer multimodal reasoning. Together, these contributions advance both understanding and reliability of LVLMs in real-world contexts.

Abstract

Large Vision-Language Models (LVLMs) have shown remarkable capabilities, yet hallucinations remain a persistent challenge. This work presents a systematic analysis of the internal evolution of visual perception and token generation in LVLMs, revealing two key patterns. First, perception follows a three-stage GATE process: early layers perform a Global scan, intermediate layers Approach and Tighten on core content, and later layers Explore supplementary regions. Second, generation exhibits an SAD (Subdominant Accumulation to Dominant) pattern, where hallucinated tokens arise from the repeated accumulation of subdominant tokens lacking support from attention (visual perception) or feed-forward network (internal knowledge). Guided by these findings, we devise the VDC (Validated Dominance Correction) strategy, which detects unsupported tokens and replaces them with validated dominant ones to improve output reliability. Extensive experiments across multiple models and benchmarks confirm that VDC substantially mitigates hallucinations.

Revealing Perception and Generation Dynamics in LVLMs: Mitigating Hallucinations via Validated Dominance Correction

TL;DR

LVLMs struggle with hallucinations despite strong grounding signals. The authors uncover two internal dynamics—GATE patterns in visual perception and SAD patterns in token generation—and introduce Validated Dominance Correction (VDC), a training-free method that validates and replaces unsupported tokens based on cross-layer dominance in attention and FFN. Across POPE, CHAIR, and MME benchmarks and multiple LVLM backbones, VDC consistently reduces hallucinations and improves factual grounding. The work offers a mechanistic, interpretable lens into LVLM behavior and a lightweight, broadly applicable mitigation strategy with practical impact for safer multimodal reasoning. Together, these contributions advance both understanding and reliability of LVLMs in real-world contexts.

Abstract

Large Vision-Language Models (LVLMs) have shown remarkable capabilities, yet hallucinations remain a persistent challenge. This work presents a systematic analysis of the internal evolution of visual perception and token generation in LVLMs, revealing two key patterns. First, perception follows a three-stage GATE process: early layers perform a Global scan, intermediate layers Approach and Tighten on core content, and later layers Explore supplementary regions. Second, generation exhibits an SAD (Subdominant Accumulation to Dominant) pattern, where hallucinated tokens arise from the repeated accumulation of subdominant tokens lacking support from attention (visual perception) or feed-forward network (internal knowledge). Guided by these findings, we devise the VDC (Validated Dominance Correction) strategy, which detects unsupported tokens and replaces them with validated dominant ones to improve output reliability. Extensive experiments across multiple models and benchmarks confirm that VDC substantially mitigates hallucinations.

Paper Structure

This paper contains 13 sections, 4 equations, 20 figures, 8 tables, 1 algorithm.

Figures (20)

  • Figure 1: Overview of our framework. We analyze hallucinations via internal perception and generation dynamics, revealing the GATE (Global, Approach & Tighten, Explore) pattern in visual perception, the SAD (Subdominant Accumulation to Dominant) pattern in token generation, and devise VDC (Validated Dominance Correction) strategy to mitigate hallucinations.
  • Figure 2: Attention ratio of different token types (System, Vision, Instruction, Output) across layers.
  • Figure 3: Visual attention heatmaps across layers.
  • Figure 4: Instruction attention heatmaps across layers.
  • Figure 5: Heatmaps of visual attention across different stages, illustrating the GATE pattern (Global–Approach&Tighten–Explore). In the Global stage, the model attends broadly to the entire image; in the Approach phase, attention gradually shifts toward the apple region; during the Tighten phase, focus converges tightly on the apple; and in the final Explore stage, attention expands again to nearby areas.
  • ...and 15 more figures