Table of Contents
Fetching ...

Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation

Jiawei Mao, Hardy Chen, Haoqin Tu, Yuhan Wang, Letian Zhang, Zeyu Zheng, Huaxiu Yao, Zirui Wang, Cihang Xie, Yuyin Zhou

Abstract

Large vision-language models (LVLMs) have become increasingly strong but remain prone to hallucinations in multimodal tasks, which significantly narrows their deployment. As training these LVLMs to avoid hallucinations becomes prohibitively expensive for larger models, training-free methods offer a cheap and flexible solution to this problem, yet existing approaches based on decoding or tool use often bring limited gains and/or weak interpretability. We propose Kestrel, a training-free framework for LVLM hallucination mitigation that combines an explicit visual-grounding agent with evidence-verified self-refinement mechanism. In detail, Kestrel first collects explicit visual evidence and converts tool outputs into reusable and structured textual evidence. Second, to take full advantage of these evidence, Kestrel verifies them via an LVLM judge for evidence checking, then iteratively self-refine answers based on verified evidence to reduce the risk of over-correction. Extensive experiments show that Kestrel improves performance over strong baselines across hallucination benchmarks (e.g., average +3.31% on POPE and +28.34 on MME-Hallucination with Qwen3-VL), while providing transparent verification traces for hallucination diagnosis and analysis -- e.g., both the integrated self-refinement module and grounding agent contributing an average +2.0% gain on POPE.

Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation

Abstract

Large vision-language models (LVLMs) have become increasingly strong but remain prone to hallucinations in multimodal tasks, which significantly narrows their deployment. As training these LVLMs to avoid hallucinations becomes prohibitively expensive for larger models, training-free methods offer a cheap and flexible solution to this problem, yet existing approaches based on decoding or tool use often bring limited gains and/or weak interpretability. We propose Kestrel, a training-free framework for LVLM hallucination mitigation that combines an explicit visual-grounding agent with evidence-verified self-refinement mechanism. In detail, Kestrel first collects explicit visual evidence and converts tool outputs into reusable and structured textual evidence. Second, to take full advantage of these evidence, Kestrel verifies them via an LVLM judge for evidence checking, then iteratively self-refine answers based on verified evidence to reduce the risk of over-correction. Extensive experiments show that Kestrel improves performance over strong baselines across hallucination benchmarks (e.g., average +3.31% on POPE and +28.34 on MME-Hallucination with Qwen3-VL), while providing transparent verification traces for hallucination diagnosis and analysis -- e.g., both the integrated self-refinement module and grounding agent contributing an average +2.0% gain on POPE.
Paper Structure (34 sections, 13 figures, 4 tables)

This paper contains 34 sections, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Kestrel progressively corrects hallucinated LVLM answers by integrating an external grounding agent with iterative self-improvement. At each round, the model grounds the current claim with explicit visual and textual evidence, conducts claim-level verification, and conservatively refines the response, yielding a final answer that is both more reliable and more interpretable.
  • Figure 2: Kestrel vs. prior training-free hallucination mitigation methods. By combining an external grounding agent with iterative self-improvement, Kestrel collects explicit visual evidence and further converts tool outputs into structured textual evidence for verification. This design yields more interpretable and stable evidence, reduces overconfident corrections, and avoids biased interpretation that may arise when LVLMs rely only on raw visual evidence compared with prior approaches.
  • Figure 3: Overview of Kestrel. Given an image-question pair, Kestrel follows a training-free four-stage pipeline for LVLM hallucination mitigation: (1) Initialization, which obtains an initial answer and rewrites it into question-aligned verifiable claims with associated visual entities and claim types; (2) Agent Grounding, which invokes an external SAM3-based grounding agent to collect explicit visual evidence (e.g., segmentation overlays, boxes, and crop-and-zoom views) and convert them into structured textual evidence; (3) Claim-level Verification, which verifies each claim against the cited evidence to produce claim-wise verdicts, confidence scores, and a top-level verification decision; and (4) Self-Refinement, which performs evidence-gated answer updating based on the current and previous verification traces.
  • Figure 4: Prediction transition statistics with Qwen3-VL before and after refinement. Prediction are categorized into four types: correctly preserved, error corrected, over-corrected, and incorrectly preserved. The results show that the refinement process is conservative, retaining most originally correct predictions while correcting a portion of erroneous ones, with limited over-correction. Zoom in for a better view.
  • Figure 5: Qualitative results of Kestrel. We compare the VQA responses from the regular baseline and our method based on Qwen3-VL. Zoom in for a better view.
  • ...and 8 more figures