Table of Contents
Fetching ...

MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs

Haonan Ge, Yiwei Wang, Ming-Hsuan Yang, Yujun Cai

TL;DR

MRFD tackles object hallucinations in LVLMs by a training-free decoding approach that leverages multi-region perspectives. Regions are identified via cross-attention, analyzed independently, and fused using JSD-based reliability weights with region-aware prompts that mimic a self-consistent Chain-of-Thought process. The method shows state-of-the-art reductions in hallucinations across POPE, CHAIR, and MME-Hallucination benchmarks on open-source LVLMs, with a transparent ablation validating the contribution of each component. By enhancing factual grounding without model updates, MRFD offers a scalable, deployment-friendly route to more reliable multimodal reasoning.

Abstract

Large Vision-Language Models (LVLMs) have shown strong performance across multimodal tasks. However, they often produce hallucinations -- text that is inconsistent with visual input, due to the limited ability to verify information in different regions of the image. To address this, we propose Multi-Region Fusion Decoding (MRFD), a training-free decoding method that improves factual grounding by modeling inter-region consistency. MRFD identifies salient regions using cross-attention, generates initial responses for each, and computes reliability weights based on Jensen-Shannon Divergence (JSD) among the responses. These weights guide a consistency-aware fusion of per-region predictions, using region-aware prompts inspired by Chain-of-Thought reasoning. Experiments across multiple LVLMs and benchmarks show that MRFD significantly reduces hallucinations and improves response factuality without requiring model updates.

MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs

TL;DR

MRFD tackles object hallucinations in LVLMs by a training-free decoding approach that leverages multi-region perspectives. Regions are identified via cross-attention, analyzed independently, and fused using JSD-based reliability weights with region-aware prompts that mimic a self-consistent Chain-of-Thought process. The method shows state-of-the-art reductions in hallucinations across POPE, CHAIR, and MME-Hallucination benchmarks on open-source LVLMs, with a transparent ablation validating the contribution of each component. By enhancing factual grounding without model updates, MRFD offers a scalable, deployment-friendly route to more reliable multimodal reasoning.

Abstract

Large Vision-Language Models (LVLMs) have shown strong performance across multimodal tasks. However, they often produce hallucinations -- text that is inconsistent with visual input, due to the limited ability to verify information in different regions of the image. To address this, we propose Multi-Region Fusion Decoding (MRFD), a training-free decoding method that improves factual grounding by modeling inter-region consistency. MRFD identifies salient regions using cross-attention, generates initial responses for each, and computes reliability weights based on Jensen-Shannon Divergence (JSD) among the responses. These weights guide a consistency-aware fusion of per-region predictions, using region-aware prompts inspired by Chain-of-Thought reasoning. Experiments across multiple LVLMs and benchmarks show that MRFD significantly reduces hallucinations and improves response factuality without requiring model updates.

Paper Structure

This paper contains 29 sections, 13 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: The MRFD process: leveraging multiple regional responses (Captions 1-4), a JSD-Weighted Block derives consistency weights to guide a prompted fusion decoding, yielding a more reliable output.
  • Figure 2: LVLM cross-attention patterns for "Is there a laptop in the image?". (upper) Full image input results in scattered attention and potential error. (lower) Cropped image input focused on the laptop yields concentrated attention and improved accuracy.
  • Figure 3: Density distribution of JS Divergence for correct versus hallucinated LVLM responses, indicating lower JSD correlates with higher factual accuracy.
  • Figure 4: Overall framework of Multi-Region Fusion Decoding (MRFD): Step 1 uses attention to select and crop salient regions ($v_k$), generates candidate responses ($r_k$) per region, and computes JSD-based consistency weights ($w_k$) for each response. Step 2 forms new inputs per region with a candidate response and the original prompt. They are all processed in parallel, fusing per-region logits using the weights $w_k$ during parallel decoding to select the output tokens.
  • Figure 5: Experimental results of MME on a hallucination subset with different decoding strategies.
  • ...and 4 more figures