Table of Contents
Fetching ...

Multimodal Reranking for Knowledge-Intensive Visual Question Answering

Haoyang Wen, Honglei Zhuang, Hamed Zamani, Alexander Hauptmann, Michael Bendersky

TL;DR

The paper tackles KI-VQA by augmenting a standard retrieval-and-generation pipeline with a multi-modal reranker that enables cross-item interaction between the question and knowledge candidates. By leveraging a Wikipedia-based image-text dataset and a vision-language Transformer framework, the approach refines candidate relevance before answer reasoning, guided by distant supervision. Experimental results on OK-VQA and A-OKVQA show consistent improvements over retrieval-only baselines and reveal a training-testing discrepancy: noisier training data can improve robustness when test-time candidates are noisy. The work highlights an upper-bound potential via oracle ranking and points to future directions in memory-efficient multi-modal reasoning and broader applicability to vision-language tasks.

Abstract

Knowledge-intensive visual question answering requires models to effectively use external knowledge to help answer visual questions. A typical pipeline includes a knowledge retriever and an answer generator. However, a retriever that utilizes local information, such as an image patch, may not provide reliable question-candidate relevance scores. Besides, the two-tower architecture also limits the relevance score modeling of a retriever to select top candidates for answer generator reasoning. In this paper, we introduce an additional module, a multi-modal reranker, to improve the ranking quality of knowledge candidates for answer generation. Our reranking module takes multi-modal information from both candidates and questions and performs cross-item interaction for better relevance score modeling. Experiments on OK-VQA and A-OKVQA show that multi-modal reranker from distant supervision provides consistent improvements. We also find a training-testing discrepancy with reranking in answer generation, where performance improves if training knowledge candidates are similar to or noisier than those used in testing.

Multimodal Reranking for Knowledge-Intensive Visual Question Answering

TL;DR

The paper tackles KI-VQA by augmenting a standard retrieval-and-generation pipeline with a multi-modal reranker that enables cross-item interaction between the question and knowledge candidates. By leveraging a Wikipedia-based image-text dataset and a vision-language Transformer framework, the approach refines candidate relevance before answer reasoning, guided by distant supervision. Experimental results on OK-VQA and A-OKVQA show consistent improvements over retrieval-only baselines and reveal a training-testing discrepancy: noisier training data can improve robustness when test-time candidates are noisy. The work highlights an upper-bound potential via oracle ranking and points to future directions in memory-efficient multi-modal reasoning and broader applicability to vision-language tasks.

Abstract

Knowledge-intensive visual question answering requires models to effectively use external knowledge to help answer visual questions. A typical pipeline includes a knowledge retriever and an answer generator. However, a retriever that utilizes local information, such as an image patch, may not provide reliable question-candidate relevance scores. Besides, the two-tower architecture also limits the relevance score modeling of a retriever to select top candidates for answer generator reasoning. In this paper, we introduce an additional module, a multi-modal reranker, to improve the ranking quality of knowledge candidates for answer generation. Our reranking module takes multi-modal information from both candidates and questions and performs cross-item interaction for better relevance score modeling. Experiments on OK-VQA and A-OKVQA show that multi-modal reranker from distant supervision provides consistent improvements. We also find a training-testing discrepancy with reranking in answer generation, where performance improves if training knowledge candidates are similar to or noisier than those used in testing.
Paper Structure (19 sections, 7 equations, 3 figures, 6 tables)

This paper contains 19 sections, 7 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: An example from OK-VQA, which requires knowledge to associate deep-dish pizza and Chicago.
  • Figure 2: A basic KI-VQA framework, which first retrieves relevant top knowledge candidates with using visual question and then combine the question and retrieved knowledge candidates to generate the answer. The dashed box is our reranking module in Section \ref{['sec:reranking']}.
  • Figure 3: Framework of multimodal reranking.