Table of Contents
Fetching ...

Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation

Boyang Gong, Yu Zheng, Fanye Kong, Jie Zhou, Jiwen Lu

Abstract

Like a body at rest that stays at rest, we find that visual attention in multimodal large language models (MLLMs) exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to support the compositional understanding required for cognitive inference. While existing hallucination mitigation methods mainly target perceptual hallucinations concerning object existence or attributes, they remain inadequate for such cognitive hallucinations that require inter-object relational deduction. Through token-wise attention analysis, we identify this visual inertia as a key factor: attention to semantically critical regions remains persistently focused and fails to dynamically support relational inference. We thereby propose a training-free Inertia-aware Visual Excitation (IVE) method that breaks this inertial pattern by modeling cognitive inference as the dynamic responsiveness of visual attention. Specifically, IVE selects visual tokens that are dynamically emerging relative to historical attention trends while distinguishing tokens exhibiting inertial behavior. To further facilitate compositional inference, IVE introduces an inertia-aware penalty that discourages over-concentration and limits the persistence of attention within localized regions. Extensive experiments show that IVE is effective across various base MLLMs and multiple hallucination benchmarks, particularly for cognitive hallucinations.

Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation

Abstract

Like a body at rest that stays at rest, we find that visual attention in multimodal large language models (MLLMs) exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to support the compositional understanding required for cognitive inference. While existing hallucination mitigation methods mainly target perceptual hallucinations concerning object existence or attributes, they remain inadequate for such cognitive hallucinations that require inter-object relational deduction. Through token-wise attention analysis, we identify this visual inertia as a key factor: attention to semantically critical regions remains persistently focused and fails to dynamically support relational inference. We thereby propose a training-free Inertia-aware Visual Excitation (IVE) method that breaks this inertial pattern by modeling cognitive inference as the dynamic responsiveness of visual attention. Specifically, IVE selects visual tokens that are dynamically emerging relative to historical attention trends while distinguishing tokens exhibiting inertial behavior. To further facilitate compositional inference, IVE introduces an inertia-aware penalty that discourages over-concentration and limits the persistence of attention within localized regions. Extensive experiments show that IVE is effective across various base MLLMs and multiple hallucination benchmarks, particularly for cognitive hallucinations.

Paper Structure

This paper contains 16 sections, 15 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: Overview of the cognitive hallucinations and their mitigation via the proposed . Upper Left: Comparison between perceptual and cognitive hallucinations in MLLMs. Upper Right: Comparison with hallucination mitigation methods across multiple benchmarks. Bottom: Compared to PAI liu2024paying which amplifies visual attention designed primarily for mitigating perceptual hallucinations, our effectively reduces both perceptual and cognitive hallucinations by exciting inertial visual attention.
  • Figure 2: Limited Effectiveness of Visual Attention Amplification on Cognitive Hallucinations. We assess the improvements of attention amplification (PAI) liu2024paying on perceptual (POPE) Li-hallucination-2023 and cognitive (Reefknot) zheng2024reefknot hallucination benchmarks.
  • Figure 3: The naive visual attention amplification method exacerbates visual inertia. A visual activeness comparison between (a) the baseline model and (b) the naive attention amplification method PAI liu2024paying shows that the amplification strategy reduces visual attention activeness, thereby triggering cognitive hallucinations.
  • Figure 4: Overview of our proposed framework. Left: The autoregressive generation process of MLLMs integrated with , which operates on token attention during next-token prediction. Top Right: Trend-guided Token Selection partitions visual tokens according to their temporal deviation from historical attention trends, distinguishing dynamically emergent tokens from inertia tokens that maintain stable concentration patterns, thereby enabling adaptive emphasis on newly relevant visual regions. Bottom Right: Inertia-aware Attention Penalty quantifies the persistence of inertia tokens across decoding steps and progressively attenuates their influence, reallocating the penalized attention toward emergent tokens to discourage prolonged over-concentration on historically dominant regions.
  • Figure 5: Results (%) on the MMBench liu2024mmbench benchmark, which assesses the multidimensional performance of MLLMs.
  • ...and 12 more figures