Table of Contents
Fetching ...

Revealing and Enhancing Core Visual Regions: Harnessing Internal Attention Dynamics for Hallucination Mitigation in LVLMs

Guangtao Lyu, Qi Liu, Chenghao Xu, Jiexi Yan, Muli Yang, Xueting Li, Fen Fang, Cheng Deng

TL;DR

Positive Attention Dynamics Enhancement is proposed, a training-free attention intervention that constructs a PAD map to identify semantically core visual regions, applies per-head Median Absolute Deviation Scaling to adaptively control the intervention strength, and leverages System-Token Compensation to maintain attention to complex user instructions and support long-term output consistency.

Abstract

LVLMs have achieved strong multimodal reasoning capabilities but remain prone to hallucinations, producing outputs inconsistent with visual inputs or user instructions. Existing training-free methods, including contrastive decoding and auxiliary expert models, which incur several times more computational overhead and may introduce potential interference, as well as static internal signal enhancement, are often vulnerable to the attention sink phenomenon. We find that internal Positive Attention Dynamics (PAD) in LVLMs naturally reveal semantically core visual regions under the distortions of attention sinks. Based on this, we propose Positive Attention Dynamics Enhancement (PADE), a training-free attention intervention that constructs a PAD map to identify semantically core visual regions, applies per-head Median Absolute Deviation Scaling to adaptively control the intervention strength, and leverages System-Token Compensation to maintain attention to complex user instructions and support long-term output consistency. Experiments on multiple LVLMs and benchmarks show that PADE improves visual grounding and reduces hallucinations, validating the effectiveness of leveraging internal attention dynamics for reliable multimodal reasoning.

Revealing and Enhancing Core Visual Regions: Harnessing Internal Attention Dynamics for Hallucination Mitigation in LVLMs

TL;DR

Positive Attention Dynamics Enhancement is proposed, a training-free attention intervention that constructs a PAD map to identify semantically core visual regions, applies per-head Median Absolute Deviation Scaling to adaptively control the intervention strength, and leverages System-Token Compensation to maintain attention to complex user instructions and support long-term output consistency.

Abstract

LVLMs have achieved strong multimodal reasoning capabilities but remain prone to hallucinations, producing outputs inconsistent with visual inputs or user instructions. Existing training-free methods, including contrastive decoding and auxiliary expert models, which incur several times more computational overhead and may introduce potential interference, as well as static internal signal enhancement, are often vulnerable to the attention sink phenomenon. We find that internal Positive Attention Dynamics (PAD) in LVLMs naturally reveal semantically core visual regions under the distortions of attention sinks. Based on this, we propose Positive Attention Dynamics Enhancement (PADE), a training-free attention intervention that constructs a PAD map to identify semantically core visual regions, applies per-head Median Absolute Deviation Scaling to adaptively control the intervention strength, and leverages System-Token Compensation to maintain attention to complex user instructions and support long-term output consistency. Experiments on multiple LVLMs and benchmarks show that PADE improves visual grounding and reduces hallucinations, validating the effectiveness of leveraging internal attention dynamics for reliable multimodal reasoning.
Paper Structure (19 sections, 8 equations, 10 figures, 6 tables)

This paper contains 19 sections, 8 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Comparison of hallucination mitigation paradigms. (a) Contrastive decoding methods. (b) Auxiliary expert methods. (c) Static internal signal methods. (d) Ours PADE: internal positive attention dynamics.
  • Figure 2: Static versus dynamic internal attention signals. Static mean attention is dominated by attention sinks, while Positive Attention Dynamics (PAD) more reliably highlight semantically core visual regions.
  • Figure 3: Attention analysis of LLaVA-1.5-7B (left) and 13B (right). Top: the attention ratio of different token types (System, Vision, Instruction, Output). Bottom: heatmap visualizations of attention distributions, including (a) static attention from uniformly sampled layers, (b) layer-averaged static attention, and (c) positive attention dynamics.
  • Figure 4: Overview of our PADE. PADE identifies semantically core visual regions via Positive Attention Dynamics (PAD) and selectively enhances them in the target layer, with Median Absolute Deviation Scaling for adaptively controlling the intervention strength and System-Token Compensation to preserve attention for understanding complex instructions and ensuring consistent long-term generation.
  • Figure 5: Ablation results on the intervention layer and strength $\lambda$ of LLaVA-1.5-7B (top) and 13B (bottom).
  • ...and 5 more figures