Table of Contents
Fetching ...

ASCD: Attention-Steerable Contrastive Decoding for Reducing Hallucination in MLLM

Yujun Wang, Aniri, Jinhe Bi, Soeren Pirk, Yunpu Ma

TL;DR

The paper addresses hallucinations in multimodal LLMs by analyzing how prior contrastive decoding methods shift cross-modal attention. It introduces Attention-Steerable Contrastive Decoding (ASCD), which explicitly steers attention during decoding by positively steering text-centric heads and negatively steering only a small set of critical visual tokens, integrated within a contrastive decoding framework. A offline text-centric head selection identifies model-specific bias, and ASCD demonstrates negligible runtime overhead while delivering substantial reductions in hallucinations (CHAIR/POPE/MMHal-Bench) and preserving or improving standard VQA performance across multiple backbones and decoding schemes. This results in a practical, model-agnostic approach that enhances visual grounding and safety in multimodal generation, with broad applicability to current and future MLLMs.

Abstract

Multimodal large language models (MLLMs) frequently hallucinate by over-committing to spurious visual cues. Prior remedies-Visual and Instruction Contrastive Decoding (VCD, ICD)-mitigate this issue, yet the mechanism remains opaque. We first empirically show that their improvements systematically coincide with redistributions of cross-modal attention. Building on this insight, we propose Attention-Steerable Contrastive Decoding (ASCD), which directly steers the attention scores during decoding. ASCD combines (i) positive steering, which amplifies automatically mined text-centric heads-stable within a model and robust across domains-with (ii) negative steering, which dampens on-the-fly identified critical visual tokens. The method incurs negligible runtime and memory overhead and requires no additional training. Across five MLLM backbones and three decoding schemes, ASCD reduces hallucination on POPE, CHAIR, and MMHal-Bench by up to 38.2 percent while improving accuracy on standard VQA benchmarks, including MMMU, MM-VET, ScienceQA, TextVQA, and GQA. These results position attention steering as a simple, model-agnostic, and principled route to safer, more faithful multimodal generation.

ASCD: Attention-Steerable Contrastive Decoding for Reducing Hallucination in MLLM

TL;DR

The paper addresses hallucinations in multimodal LLMs by analyzing how prior contrastive decoding methods shift cross-modal attention. It introduces Attention-Steerable Contrastive Decoding (ASCD), which explicitly steers attention during decoding by positively steering text-centric heads and negatively steering only a small set of critical visual tokens, integrated within a contrastive decoding framework. A offline text-centric head selection identifies model-specific bias, and ASCD demonstrates negligible runtime overhead while delivering substantial reductions in hallucinations (CHAIR/POPE/MMHal-Bench) and preserving or improving standard VQA performance across multiple backbones and decoding schemes. This results in a practical, model-agnostic approach that enhances visual grounding and safety in multimodal generation, with broad applicability to current and future MLLMs.

Abstract

Multimodal large language models (MLLMs) frequently hallucinate by over-committing to spurious visual cues. Prior remedies-Visual and Instruction Contrastive Decoding (VCD, ICD)-mitigate this issue, yet the mechanism remains opaque. We first empirically show that their improvements systematically coincide with redistributions of cross-modal attention. Building on this insight, we propose Attention-Steerable Contrastive Decoding (ASCD), which directly steers the attention scores during decoding. ASCD combines (i) positive steering, which amplifies automatically mined text-centric heads-stable within a model and robust across domains-with (ii) negative steering, which dampens on-the-fly identified critical visual tokens. The method incurs negligible runtime and memory overhead and requires no additional training. Across five MLLM backbones and three decoding schemes, ASCD reduces hallucination on POPE, CHAIR, and MMHal-Bench by up to 38.2 percent while improving accuracy on standard VQA benchmarks, including MMMU, MM-VET, ScienceQA, TextVQA, and GQA. These results position attention steering as a simple, model-agnostic, and principled route to safer, more faithful multimodal generation.

Paper Structure

This paper contains 32 sections, 6 equations, 8 figures, 7 tables, 2 algorithms.

Figures (8)

  • Figure 1: Impact of VCD and ICD on attention distribution. On 500 COCO images, we measure how Visual (VCD) and Instruction (ICD) Contrastive Decoding redistribute attention in LLaVA-1.5. Both techniques—and their combination—lower attention on visual tokens (vis) while raising it on textual tokens (text), with stronger perturbations yielding larger shifts. This suggests that the reduction in hallucinations achieved by VCD and ICD is largely attributable to the attendant shifts in cross-modal attention, rather than to the logit-subtraction step alone.
  • Figure 2: A motivating example of proactive attention steering in a visually ambiguous scenario. Top: Conversation context in which the “orange” appears blue-tinted. Middle: Effects of negative steering (decrease vision attention / increase text attention) and positive steering (increase vision attention / decrease text attention); ASCD contrasts the two steered logits to suppress hallucination and produce the perception-consistent answer. Bottom: Color-token logits change with the steering strength for visual and textual attention, corresponding to the steering above.
  • Figure 3: The stability of text-centric head distribution. Each heatmap visualizes how frequently a given head occurs among the most text-focused heads. LLaVA-1.5(a) remains stable across generation length(b) and image set(c), whereas Phi2-SigLIP(d) and LLaVA-NeXT(e) shift markedly.
  • Figure 4: Illustration of positive and negative steering. Left: text-centric heads are boosted (positive_steer) to emphasize visual content; Right: a small set of critical visual tokens is suppressed (negative_steer), inducing a stronger contrastive effect. These selective adjustments work in tandem to reduce hallucinations and improve grounding.
  • Figure 5: Comparative effectiveness of selective attention steering. (a): Positive steering applied only to text-centric heads outperforms random or blanket head selection across various decoding strategies. (b): Negative steering focused on a small subset of critical visual tokens, integrated with contrastive decoding, significantly reduces CHAIR metrics (less hallucination) and boosts POPE scores compared to randomly suppressing visual tokens of the same number.
  • ...and 3 more figures