Table of Contents
Fetching ...

Conscious Gaze: Adaptive Attention Mechanisms for Hallucination Mitigation in Vision-Language Models

Weijue Bu, Guan Yuan, Guixian Zhang

TL;DR

This work tackles object hallucination in vision-language models caused by text inertia by introducing Conscious Gaze, a training-free framework that uses a Cognitive Demand Sensor (CDS) to detect when visual grounding is needed and a Focused Consensus Induction (FCI) to reorient mid-layer attention toward visual tokens. Grounding improvements are achieved by a selective, token-level intervention that boosts visual-token attention only at high cognitive-demand moments, resulting in state-of-the-art POPE and CHAIR performance across multiple backbones without retraining. Key findings show CDS reliably predicts moments requiring grounding, with middle-layer FCI offering the strongest gains while preserving diversity and fluency. The approach demonstrates that interpretable, game-theoretic signals can be transformed into practical decoding controls, offering a scalable, efficient path to more trustworthy multimodal systems.

Abstract

Large Vision-Language Models (VLMs) often exhibit text inertia, where attention drifts from visual evidence toward linguistic priors, resulting in object hallucinations. Existing decoding strategies intervene only at the output logits and thus cannot correct internal reasoning drift, while recent internal-control methods based on heuristic head suppression or global steering vectors lack principled grounding. We introduce Conscious Gaze (CG-VLM), a training-free, inference-time framework that converts game-theoretic interpretability into actionable decoding control. A Cognitive Demand Sensor built on Harsanyi interactions estimates instantaneous vision-text synergy and identifies moments when visual grounding is necessary. Conditioned on this signal, a Focused Consensus Induction module selectively reorients mid-layer attention toward visual tokens before collapse into text priors. CG-VLM achieves state-of-the-art results on POPE and CHAIR across InstructBLIP, LLaVA, Qwen-VL, and mPLUG, while preserving general capabilities, demonstrating that token-level sensing enables precise, context-aware intervention without compromising foundational knowledge.

Conscious Gaze: Adaptive Attention Mechanisms for Hallucination Mitigation in Vision-Language Models

TL;DR

This work tackles object hallucination in vision-language models caused by text inertia by introducing Conscious Gaze, a training-free framework that uses a Cognitive Demand Sensor (CDS) to detect when visual grounding is needed and a Focused Consensus Induction (FCI) to reorient mid-layer attention toward visual tokens. Grounding improvements are achieved by a selective, token-level intervention that boosts visual-token attention only at high cognitive-demand moments, resulting in state-of-the-art POPE and CHAIR performance across multiple backbones without retraining. Key findings show CDS reliably predicts moments requiring grounding, with middle-layer FCI offering the strongest gains while preserving diversity and fluency. The approach demonstrates that interpretable, game-theoretic signals can be transformed into practical decoding controls, offering a scalable, efficient path to more trustworthy multimodal systems.

Abstract

Large Vision-Language Models (VLMs) often exhibit text inertia, where attention drifts from visual evidence toward linguistic priors, resulting in object hallucinations. Existing decoding strategies intervene only at the output logits and thus cannot correct internal reasoning drift, while recent internal-control methods based on heuristic head suppression or global steering vectors lack principled grounding. We introduce Conscious Gaze (CG-VLM), a training-free, inference-time framework that converts game-theoretic interpretability into actionable decoding control. A Cognitive Demand Sensor built on Harsanyi interactions estimates instantaneous vision-text synergy and identifies moments when visual grounding is necessary. Conditioned on this signal, a Focused Consensus Induction module selectively reorients mid-layer attention toward visual tokens before collapse into text priors. CG-VLM achieves state-of-the-art results on POPE and CHAIR across InstructBLIP, LLaVA, Qwen-VL, and mPLUG, while preserving general capabilities, demonstrating that token-level sensing enables precise, context-aware intervention without compromising foundational knowledge.

Paper Structure

This paper contains 14 sections, 4 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Breaking the Text Inertia Trap.Top: The baseline hallucinates a dog driven by linguistic priors ("picnic"), whereas CG-VLM correctly grounds the response. Bottom: Attention heatmaps reveal the mechanism. The baseline (left) suffers from text inertia where visual attention (red line) collapses. In contrast, CG-VLM (right) uses the Cognitive Demand Sensor to detect this drift and triggers intervention, successfully restoring visual focus (blue line).
  • Figure 2: Comparison of outputs from CG-VLM and two baselines. Left (Nucleus Sampling): The baseline model hallucinates and claims that there is only one person. Middle (Static FCI): The model is factually accurate but produces stilted language. Right (CG-VLM): Our method correctly identifies multiple people, the bus, and the skis while delivering a fluent description. GPT-4o rates CG-VLM highest for both fluency (9/10) and accuracy (9/10).
  • Figure 3: Cognitive Demand Sensor (CDS) overview. Given the image and prefix $y_{<t}$, CDS computes interaction variance $D_{y_t}$ over top-$k$ candidates and compares it with threshold $\kappa$. When $D_{y_t}>\kappa$, the gate $\beta_t$ activates FCI (pink region).
  • Figure 4: Attention consensus evidence. Top: per-head visual attention ratios across decoding steps before (No FCI) and after CG-VLM intervention. Bottom: CG-VLM lowers the head divergence index (left) and shifts aggregated attention toward visual tokens (right), directly illustrating CDS-triggered FCI at work.
  • Figure 5: Effect of CDS and FCI on POPE-COCO (InstructBLIP). (a) POPE F1 gains from progressively stronger gating, with Distinct-2 and trigger rates showing CDS keeps diversity high while firing on 48% of tokens. (b) GPT-4o blind scores on 2000 COCO prompts confirm that CG-VLM improves accuracy/detail without hurting fluency.
  • ...and 1 more figures