GroundSight: Augmenting Vision-Language Models with Grounding Information and De-hallucination
Xinxi Chen, Tianyang Chen, Lijia Hong
TL;DR
GroundSight presents a retrieval-augmented VQA framework that injects text-grounded region localization to focus visual search on objects relevant to the question, thereby reducing background noise and hallucinations. It introduces a four-stage IoU-guided fine-tuning regime to empower LVLMs to ground bounding boxes from natural language prompts, and leverages Grounding DINO or other ROIs to crop images before retrieval. A de-hallucination module conditioned on question type significantly lowers erroneous outputs, and a 0.75 CLIP threshold helps prune irrelevant search results. Together, these components yield improved end-to-end performance on a real-world, multi-domain VQA dataset, with GroundSight achieving the best truthfulness and reduced hallucination compared to ablations, albeit with trade-offs in inference time and reproducibility.
Abstract
We propose a method to improve Visual Question Answering (VQA) with Retrieval-Augmented Generation (RAG) by introducing text-grounded object localization. Rather than retrieving information based on the entire image, our approach enables the model to generate a bounding box around the object most relevant to the question, allowing for targeted image cropping and focused retrieval. This reduces background noise, improves alignment between visual and textual cues, and helps mitigate hallucinations. Our RAG method enhances context-aware VQA responses increased the accuracy from 22.19% to 25.64%, with an absolute increase of 3.45 percentage points, compared to the baseline Llama-3.2-Vision-11B agent. We also proposed a de-hallucination method based on question type which can effectively reduce the hallucination rate from 65.79% to 13.88% and improves the truthfulness score.
