Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding

Boqi Chen; Xudong Liu; Jianing Qiu

Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding

Boqi Chen, Xudong Liu, Jianing Qiu

TL;DR

The paper tackles object hallucination in Multimodal Large Language Models (MLLMs) by introducing object-aligned auxiliary views for visual contrastive decoding (VCD). It generates the auxiliary view by masking salient regions identified via a self-supervised Vision Transformer (DINO) attention, and integrates this view into VCD with an adaptive plausibility constraint, forming a contrastive distribution $p_ ext{vcd}(y \,|\ v,v',x) = \mathrm{softmax}((1+\alpha)\mathrm{logit}_\theta(y \,|\ v,x,y_{<t}) - \alpha\mathrm{logit}_\theta(y \,|\ v',x,y_{<t}))$ and token-acceptance governed by $p_\theta(y_t \,|\ v,x,y_{<t}) \ge \beta \max_w p_\theta(w \,|\ v,x,y_{<t})$. The approach is prompt-agnostic and model-agnostic, requiring only a single cacheable forward pass, and yields consistent improvements on POPE and MME benchmarks across LLaVA-v1.5 and Qwen-VL. By producing semantically meaningful, object-level perturbations, the method strengthens grounding while maintaining fluency, though it relies on the saliency map aligning with visual evidence and may under- or over-mask in cluttered scenes. Overall, this work offers a practical and effective means to reduce object hallucinations in multimodal generation. $

Abstract

We study object hallucination in Multimodal Large Language Models (MLLMs) and improve visual contrastive decoding (VCD) by constructing an object-aligned auxiliary view. We leverage object-centric attention in self-supervised Vision Transformers. In particular, we remove the most salient visual evidence to construct an auxiliary view that disrupts unsupported tokens and produces a stronger contrast signal. Our method is prompt-agnostic, model-agnostic, and can be seamlessly plugged into the existing VCD pipeline with little computation overhead, i.e., a single cacheable forward pass. Empirically, our method demonstrates consistent gains on two popular object hallucination benchmarks across two MLLMs.

Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding

TL;DR

and token-acceptance governed by

. The approach is prompt-agnostic and model-agnostic, requiring only a single cacheable forward pass, and yields consistent improvements on POPE and MME benchmarks across LLaVA-v1.5 and Qwen-VL. By producing semantically meaningful, object-level perturbations, the method strengthens grounding while maintaining fluency, though it relies on the saliency map aligning with visual evidence and may under- or over-mask in cluttered scenes. Overall, this work offers a practical and effective means to reduce object hallucinations in multimodal generation. $

Abstract

Paper Structure (27 sections, 13 equations, 6 figures, 7 tables)

This paper contains 27 sections, 13 equations, 6 figures, 7 tables.

Introduction
Related Work
Contrastive decoding for reducing object hallucination.
Method
Visual Contrastive Decoding
Generate Auxiliary Views
Experiments
Settings.
Results.
Ablations
Case study
Conclusion
Appendix
Detailed Experiment Settings
Benchmarks
...and 12 more sections

Figures (6)

Figure 1: Overview of our method. (a) Regular decoding; (b) decoding using the auxiliary view where visual evidence is removed; (c) contrastive decoding.
Figure 2: Captions generated by different decoding methods. Hallucinated contents are highlighted in red.
Figure 3: Results averaged across three seeds on the hallucination subset of MME with LLaVA-v1.5 (7B).
Figure 4: Results averaged across three seeds on the hallucination subset of MME with Qwen-VL (7B).
Figure 5: Visualization of generated auxiliary views with different thresholds. Background are all set to mean color.
...and 1 more figures

Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding

TL;DR

Abstract

Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding

Authors

TL;DR

Abstract

Table of Contents

Figures (6)