Table of Contents
Fetching ...

Retrieval-Augmented Visual Question Answering via Built-in Autoregressive Search Engines

Xinwei Long, Zhiyuan Ma, Ermo Hua, Kaiyan Zhang, Biqing Qi, Bowen Zhou

TL;DR

ReAuSE addresses knowledge-intensive visual question answering by unifying retrieval and generation inside a single generative multi-modal LLM. It introduces a built-in autoregressive search engine to produce document identifiers and leverages reinforcement via relevance feedback with Direct Preference Optimization to align retrieval with answer quality. Across OKVQA and A-OKVQA, ReAuSE achieves significant performance gains over strong baselines, with robust retrieval performance across large knowledge bases and efficient, end-to-end inference. The work demonstrates a scalable KBVQA solution that leverages generative retrieval to integrate external knowledge directly into answer generation, reducing reliance on separate discriminative retrievers. Overall, ReAuSE offers a practical, end-to-end framework for knowledge-grounded VQA with strong empirical gains and avenues for domain-specific extension.

Abstract

Retrieval-augmented generation (RAG) has emerged to address the knowledge-intensive visual question answering (VQA) task. Current methods mainly employ separate retrieval and generation modules to acquire external knowledge and generate answers, respectively. We propose ReAuSE, an alternative to the previous RAG model for the knowledge-based VQA task, which seamlessly integrates knowledge retriever into the generative multi-modal large language model, serving as a built-in search engine. Specifically, our model functions both as a generative retriever and an accurate answer generator. It not only helps retrieve documents from the knowledge base by producing identifiers for each document, but it also answers visual questions based on the retrieved documents. Furthermore, we propose a reinforced retrieval calibration module from relevance feedback to improve retrieval performance and align with the preferences for accurate answer generation. Extensive experiments on two representative OKVQA and A-OKVQA datasets demonstrate significant improvements ranging from 2.9\% to 9.6\% across all evaluation metrics when compared to strong baselines.

Retrieval-Augmented Visual Question Answering via Built-in Autoregressive Search Engines

TL;DR

ReAuSE addresses knowledge-intensive visual question answering by unifying retrieval and generation inside a single generative multi-modal LLM. It introduces a built-in autoregressive search engine to produce document identifiers and leverages reinforcement via relevance feedback with Direct Preference Optimization to align retrieval with answer quality. Across OKVQA and A-OKVQA, ReAuSE achieves significant performance gains over strong baselines, with robust retrieval performance across large knowledge bases and efficient, end-to-end inference. The work demonstrates a scalable KBVQA solution that leverages generative retrieval to integrate external knowledge directly into answer generation, reducing reliance on separate discriminative retrievers. Overall, ReAuSE offers a practical, end-to-end framework for knowledge-grounded VQA with strong empirical gains and avenues for domain-specific extension.

Abstract

Retrieval-augmented generation (RAG) has emerged to address the knowledge-intensive visual question answering (VQA) task. Current methods mainly employ separate retrieval and generation modules to acquire external knowledge and generate answers, respectively. We propose ReAuSE, an alternative to the previous RAG model for the knowledge-based VQA task, which seamlessly integrates knowledge retriever into the generative multi-modal large language model, serving as a built-in search engine. Specifically, our model functions both as a generative retriever and an accurate answer generator. It not only helps retrieve documents from the knowledge base by producing identifiers for each document, but it also answers visual questions based on the retrieved documents. Furthermore, we propose a reinforced retrieval calibration module from relevance feedback to improve retrieval performance and align with the preferences for accurate answer generation. Extensive experiments on two representative OKVQA and A-OKVQA datasets demonstrate significant improvements ranging from 2.9\% to 9.6\% across all evaluation metrics when compared to strong baselines.

Paper Structure

This paper contains 16 sections, 7 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Comparing with the Paradigm of Previous Knowledge-Based VQA Methods.
  • Figure 2: The architecture of ReAuSE. ReAuSE contains three components: Built-in Autoregressive Search Engine for knowledge retrieval, Reinforced Retrieval Calibration via Relevance Feedback to align retrievers with relevance preferences, and Retrieval-Augmented Generation for answer prediction.
  • Figure 3: Case Studies. The highlighted text represents the document identifier generated by our model.