MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs
Haonan Ge, Yiwei Wang, Ming-Hsuan Yang, Yujun Cai
TL;DR
MRFD tackles object hallucinations in LVLMs by a training-free decoding approach that leverages multi-region perspectives. Regions are identified via cross-attention, analyzed independently, and fused using JSD-based reliability weights with region-aware prompts that mimic a self-consistent Chain-of-Thought process. The method shows state-of-the-art reductions in hallucinations across POPE, CHAIR, and MME-Hallucination benchmarks on open-source LVLMs, with a transparent ablation validating the contribution of each component. By enhancing factual grounding without model updates, MRFD offers a scalable, deployment-friendly route to more reliable multimodal reasoning.
Abstract
Large Vision-Language Models (LVLMs) have shown strong performance across multimodal tasks. However, they often produce hallucinations -- text that is inconsistent with visual input, due to the limited ability to verify information in different regions of the image. To address this, we propose Multi-Region Fusion Decoding (MRFD), a training-free decoding method that improves factual grounding by modeling inter-region consistency. MRFD identifies salient regions using cross-attention, generates initial responses for each, and computes reliability weights based on Jensen-Shannon Divergence (JSD) among the responses. These weights guide a consistency-aware fusion of per-region predictions, using region-aware prompts inspired by Chain-of-Thought reasoning. Experiments across multiple LVLMs and benchmarks show that MRFD significantly reduces hallucinations and improves response factuality without requiring model updates.
