Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection
Shan Wang, Maying Shen, Nadine Chang, Chuong Nguyen, Hongdong Li, Jose M. Alvarez
TL;DR
This work tackles hallucinations in multimodal LLMs by introducing GACD, an inference-time method that uses a first-order Taylor expansion to estimate token-level biases in the visual and textual inputs. It then mitigates hallucinations via object-aware visual token grouping and anchor-specific influence-weighted decoding, optionally applying sample-dependent early stopping to maintain grounding in long generations. Across open-ended generation and discriminative tasks, GACD reduces hallucinations, improves grounding, and preserves or enhances informativeness, achieving notable gains on benchmarks such as AMBER, POPE, and LLaVA-QA90 while remaining model-agnostic at inference. The approach improves visual grounding without retraining or external models, offering a practical and scalable solution with broad applicability to vision-language tasks and beyond.
Abstract
Multimodal large language models achieve strong performance across diverse tasks but remain prone to hallucinations, where outputs are not grounded in visual inputs. This issue can be attributed to two main biases: text-visual bias, the overreliance on prompts and prior outputs, and co-occurrence bias, spurious correlations between frequently paired objects. We propose Gradient-based Influence-Aware Constrained Decoding (GACD), an inference-based method, that addresses both biases without auxiliary models, and is readily applicable to existing models without finetuning. The core of our approach is bias estimation, which uses first-order Taylor gradients to understand the contribution of individual tokens-visual features and text tokens-to the current output. Based on this analysis, GACD mitigates hallucinations through two components: (1) suppressing spurious visual features correlated with the output objects, and (2) rebalancing cross-modal contributions by strengthening visual features relative to text. Experiments across multiple benchmarks demonstrate that GACD effectively reduces hallucinations and improves the visual grounding of MLLM outputs.
