Table of Contents
Fetching ...

Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs

Vishal Narnaware, Animesh Gupta, Kevin Zhai, Zhenyi Wang, Mubarak Shah

Abstract

Multimodal Diffusion Large Language Models (MDLLMs) achieve high-concurrency generation through parallel masked decoding, yet the architectures remain prone to multimodal hallucinations. This structural vulnerability stems from an algorithmic flaw: the decoder ranks candidate tokens based on textual likelihood without verifying localized visual support. We establish that this language-only ranking induces an objective mismatch, where language probability mass acts as a misspecified proxy for the intended multimodal task. Consequently, we reinterpret hallucination as a localized optimization error, a phenomenon where the decoder exploits language shortcuts to maximize a proxy score at the expense of visual grounding. To address this objective mismatch, we introduce VISAGE, a training-free decoding framework that calibrates the objective at inference time. VISAGE estimates the proxy discrepancy by quantifying the spatial entropy of cross-attention distributions. By enforcing a localization consensus across attention heads, the method penalizes spatially uniform distributions and re-ranks token commitments to favor visually grounded outcomes. We provide an analytical stability guarantee establishing that VISAGE maintains a bounded objective loss under estimation error. Evaluations across hallucination-sensitive and general-purpose benchmarks demonstrate the robustness of the framework, yielding relative gains of 8.59% on MMMU-val and 7.75% on HallusionBench.

Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs

Abstract

Multimodal Diffusion Large Language Models (MDLLMs) achieve high-concurrency generation through parallel masked decoding, yet the architectures remain prone to multimodal hallucinations. This structural vulnerability stems from an algorithmic flaw: the decoder ranks candidate tokens based on textual likelihood without verifying localized visual support. We establish that this language-only ranking induces an objective mismatch, where language probability mass acts as a misspecified proxy for the intended multimodal task. Consequently, we reinterpret hallucination as a localized optimization error, a phenomenon where the decoder exploits language shortcuts to maximize a proxy score at the expense of visual grounding. To address this objective mismatch, we introduce VISAGE, a training-free decoding framework that calibrates the objective at inference time. VISAGE estimates the proxy discrepancy by quantifying the spatial entropy of cross-attention distributions. By enforcing a localization consensus across attention heads, the method penalizes spatially uniform distributions and re-ranks token commitments to favor visually grounded outcomes. We provide an analytical stability guarantee establishing that VISAGE maintains a bounded objective loss under estimation error. Evaluations across hallucination-sensitive and general-purpose benchmarks demonstrate the robustness of the framework, yielding relative gains of 8.59% on MMMU-val and 7.75% on HallusionBench.

Paper Structure

This paper contains 35 sections, 24 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: VISAGE resolves language shortcuts by correcting the objective mismatch during parallel unmasking. Given the query "Is there a cup in the image?", the standard decoder (left) assigns a high language-only confidence ($0.9$) to the statistically plausible token "no." Because this proxy score lacks visual verification, the token is finalized prematurely and induces a hallucination. In contrast, VISAGE (right) calculates a visually corrected confidence ($0.6$) by estimating the proxy discrepancy $b_i$. This modification to the ranking score prevents the early commitment of the ungrounded token, allowing the model to recover the correct multimodal outcome: "There is a cup on the deck."
  • Figure 2: The disparity between visual and linguistic probability mass across decoding steps suggests objective misspecification. We visualize the normalized peak probability mass for visual (orange) and language (blue) components across the refinement process. (a) In Visually Grounded Generation, peak visual probability mass exceeds the language attention prior before the commitment step (vertical dotted line), consistent with a localized, low-entropy distribution over image tokens. (b) In a Language Shortcut Hallucination, the visual component retains a uniform spatial distribution, resulting in a suppressed peak that falls below the language attention prior. This structural disparity demonstrates that the decoder optimizes for textual likelihood while bypassing localized visual probability mass, suggesting that the spatial Shannon entropy of cross-attention provides a detectable signal for re-ranking ungrounded commitments.
  • Figure 3: Overview of VISAGE during parallel masked decoding. At each decoding step, the frozen MDLLM generates candidate tokens alongside their initial confidence $c$. To verify visual support, we extract each candidate token's last-layer cross-attention weights over the image and compute Shannon entropy for each attention head. We then aggregate these values across heads via a $\beta$-quantile operator, yielding Robust Grounding Entropy $H$, which quantifies the concentration of visual support. A penalty multiplier $g = 1/(1+H)$ is then computed. Tokens are subsequently re-ranked using the linear ranking score $u = c \cdot g^{\alpha}$ ($\alpha=0.5$). As illustrated, the ungrounded token ("a") is penalized by high entropy, ensuring the visually supported token ("no") attains a high linear ranking score and is successfully committed to the sequence.
  • Figure 4: HallusionBench Category Analysis. Radar chart comparing VISAGE against the Baseline and VCD. Our method achieves robust improvements across illusion and spatial-reasoning (map, figure) categories.
  • Figure 5: Qualitative comparison of visually grounded yes/no questions. We compare MMaDA and our method on two examples where hallucination arises from language shortcuts. Top: MMaDA incorrectly answers "No" by over-relying on the textual prior that a spoon rests outside a bowl. Our method correctly grounds its "Yes" decision in localized visual evidence. Bottom: While both methods correctly answer "No", MMaDA hallucinates irrelevant objects in its intermediate thinking trace. In contrast our method maintains strictly visually consistent descriptions.
  • ...and 4 more figures