Table of Contents
Fetching ...

Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding

Boqi Chen, Xudong Liu, Jianing Qiu

TL;DR

The paper tackles object hallucination in Multimodal Large Language Models (MLLMs) by introducing object-aligned auxiliary views for visual contrastive decoding (VCD). It generates the auxiliary view by masking salient regions identified via a self-supervised Vision Transformer (DINO) attention, and integrates this view into VCD with an adaptive plausibility constraint, forming a contrastive distribution $p_ ext{vcd}(y \,|\ v,v',x) = \mathrm{softmax}((1+\alpha)\mathrm{logit}_\theta(y \,|\ v,x,y_{<t}) - \alpha\mathrm{logit}_\theta(y \,|\ v',x,y_{<t}))$ and token-acceptance governed by $p_\theta(y_t \,|\ v,x,y_{<t}) \ge \beta \max_w p_\theta(w \,|\ v,x,y_{<t})$. The approach is prompt-agnostic and model-agnostic, requiring only a single cacheable forward pass, and yields consistent improvements on POPE and MME benchmarks across LLaVA-v1.5 and Qwen-VL. By producing semantically meaningful, object-level perturbations, the method strengthens grounding while maintaining fluency, though it relies on the saliency map aligning with visual evidence and may under- or over-mask in cluttered scenes. Overall, this work offers a practical and effective means to reduce object hallucinations in multimodal generation. $

Abstract

We study object hallucination in Multimodal Large Language Models (MLLMs) and improve visual contrastive decoding (VCD) by constructing an object-aligned auxiliary view. We leverage object-centric attention in self-supervised Vision Transformers. In particular, we remove the most salient visual evidence to construct an auxiliary view that disrupts unsupported tokens and produces a stronger contrast signal. Our method is prompt-agnostic, model-agnostic, and can be seamlessly plugged into the existing VCD pipeline with little computation overhead, i.e., a single cacheable forward pass. Empirically, our method demonstrates consistent gains on two popular object hallucination benchmarks across two MLLMs.

Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding

TL;DR

The paper tackles object hallucination in Multimodal Large Language Models (MLLMs) by introducing object-aligned auxiliary views for visual contrastive decoding (VCD). It generates the auxiliary view by masking salient regions identified via a self-supervised Vision Transformer (DINO) attention, and integrates this view into VCD with an adaptive plausibility constraint, forming a contrastive distribution and token-acceptance governed by . The approach is prompt-agnostic and model-agnostic, requiring only a single cacheable forward pass, and yields consistent improvements on POPE and MME benchmarks across LLaVA-v1.5 and Qwen-VL. By producing semantically meaningful, object-level perturbations, the method strengthens grounding while maintaining fluency, though it relies on the saliency map aligning with visual evidence and may under- or over-mask in cluttered scenes. Overall, this work offers a practical and effective means to reduce object hallucinations in multimodal generation. $

Abstract

We study object hallucination in Multimodal Large Language Models (MLLMs) and improve visual contrastive decoding (VCD) by constructing an object-aligned auxiliary view. We leverage object-centric attention in self-supervised Vision Transformers. In particular, we remove the most salient visual evidence to construct an auxiliary view that disrupts unsupported tokens and produces a stronger contrast signal. Our method is prompt-agnostic, model-agnostic, and can be seamlessly plugged into the existing VCD pipeline with little computation overhead, i.e., a single cacheable forward pass. Empirically, our method demonstrates consistent gains on two popular object hallucination benchmarks across two MLLMs.
Paper Structure (27 sections, 13 equations, 6 figures, 7 tables)

This paper contains 27 sections, 13 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Overview of our method. (a) Regular decoding; (b) decoding using the auxiliary view where visual evidence is removed; (c) contrastive decoding.
  • Figure 2: Captions generated by different decoding methods. Hallucinated contents are highlighted in red.
  • Figure 3: Results averaged across three seeds on the hallucination subset of MME with LLaVA-v1.5 (7B).
  • Figure 4: Results averaged across three seeds on the hallucination subset of MME with Qwen-VL (7B).
  • Figure 5: Visualization of generated auxiliary views with different thresholds. Background are all set to mean color.
  • ...and 1 more figures