Table of Contents
Fetching ...

FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA

Nobin Sarwar

TL;DR

FilterRAG addresses hallucinations in Visual Question Answering by grounding responses in external knowledge sources through a Retrieval-Augmented Generation framework built on BLIP-VQA and a frozen GPT-Neo 1.3B generator. By dividing images into patches, retrieving relevant content from Wikipedia and DBpedia, and integrating this knowledge into answer generation, the approach improves grounding and robustness, particularly in Out-of-Distribution scenarios. The probabilistic formulation $P_{\text{RAG}}(\hat{A}) \approx \prod_{i} \sum_{z \in \text{top-k}(p_\eta(\cdot \mid I, Q))} p_\eta(z \mid I, Q) p_\theta(a_i \mid I, Q, z, a_{1:i-1})$ encapsulates how retrieved knowledge shapes the final answer. Experiments on OK-VQA show the method achieves 36.5% accuracy (ID+OOD) with noticeable reductions in hallucinations, demonstrating practical potential for knowledge-grounded VQA in real-world deployments.

Abstract

Visual Question Answering requires models to generate accurate answers by integrating visual and textual understanding. However, VQA models still struggle with hallucinations, producing convincing but incorrect answers, particularly in knowledge-driven and Out-of-Distribution scenarios. We introduce FilterRAG, a retrieval-augmented framework that combines BLIP-VQA with Retrieval-Augmented Generation to ground answers in external knowledge sources like Wikipedia and DBpedia. FilterRAG achieves 36.5% accuracy on the OK-VQA dataset, demonstrating its effectiveness in reducing hallucinations and improving robustness in both in-domain and Out-of-Distribution settings. These findings highlight the potential of FilterRAG to improve Visual Question Answering systems for real-world deployment.

FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA

TL;DR

FilterRAG addresses hallucinations in Visual Question Answering by grounding responses in external knowledge sources through a Retrieval-Augmented Generation framework built on BLIP-VQA and a frozen GPT-Neo 1.3B generator. By dividing images into patches, retrieving relevant content from Wikipedia and DBpedia, and integrating this knowledge into answer generation, the approach improves grounding and robustness, particularly in Out-of-Distribution scenarios. The probabilistic formulation encapsulates how retrieved knowledge shapes the final answer. Experiments on OK-VQA show the method achieves 36.5% accuracy (ID+OOD) with noticeable reductions in hallucinations, demonstrating practical potential for knowledge-grounded VQA in real-world deployments.

Abstract

Visual Question Answering requires models to generate accurate answers by integrating visual and textual understanding. However, VQA models still struggle with hallucinations, producing convincing but incorrect answers, particularly in knowledge-driven and Out-of-Distribution scenarios. We introduce FilterRAG, a retrieval-augmented framework that combines BLIP-VQA with Retrieval-Augmented Generation to ground answers in external knowledge sources like Wikipedia and DBpedia. FilterRAG achieves 36.5% accuracy on the OK-VQA dataset, demonstrating its effectiveness in reducing hallucinations and improving robustness in both in-domain and Out-of-Distribution settings. These findings highlight the potential of FilterRAG to improve Visual Question Answering systems for real-world deployment.

Paper Structure

This paper contains 26 sections, 11 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Two examples of question-answer pairs from the OK-VQA dataset. The left example asks about the items on a hot dog, requiring models to incorporate external knowledge of common food items. The right example asks about the sport associated with a motorcycle, emphasizing the need to understand how people typically use such vehicles. These examples illustrate the fundamental challenge of OK-VQA, where models rely on external knowledge to generate accurate answers rather than depending solely on the image.
  • Figure 2: The FilterRAG architecture: A step-by-step process integrating frozen BLIP-VQA with Retrieval-Augmented Generation (RAG). The system retrieves knowledge from Wikipedia and DBpedia, augments image-question pairs, and uses frozen GPT-Neo 1.3B to generate answers.
  • Figure 3: Comparison of Model Accuracy Across Different Settings.
  • Figure 4: Grounding Score Comparison Across Baselines and Proposed Methods.
  • Figure 5: Effect of Grid Sizes on Accuracy and Grounding Score.
  • ...and 1 more figures