Table of Contents
Fetching ...

EchoSight: Advancing Visual-Language Models with Wiki Knowledge

Yibin Yan, Weidi Xie

TL;DR

EchoSight addresses the challenge of knowledge-based visual question answering by integrating large-scale encyclopedic content through a two-stage retrieval-augmented generation pipeline. It first performs a visual-only retrieval to narrow the knowledge base, then applies a multimodal reranking that leverages both image and text to select the most relevant article sections before generating an answer with an LLM. The approach yields state-of-the-art results on Encyclopedic VQA and InfoSeek, with notable gains in retrieval recall and final VQA accuracy, demonstrating the value of aligning multimodal content with visual context. The work highlights a practical pathway to incorporate precise external knowledge into vision-language systems, albeit with higher computational costs and dependence on knowledge base coverage.

Abstract

Knowledge-based Visual Question Answering (KVQA) tasks require answering questions about images using extensive background knowledge. Despite significant advancements, generative models often struggle with these tasks due to the limited integration of external knowledge. In this paper, we introduce EchoSight, a novel multimodal Retrieval-Augmented Generation (RAG) framework that enables large language models (LLMs) to answer visual questions requiring fine-grained encyclopedic knowledge. To strive for high-performing retrieval, EchoSight first searches wiki articles by using visual-only information, subsequently, these candidate articles are further reranked according to their relevance to the combined text-image query. This approach significantly improves the integration of multimodal knowledge, leading to enhanced retrieval outcomes and more accurate VQA responses. Our experimental results on the Encyclopedic VQA and InfoSeek datasets demonstrate that EchoSight establishes new state-of-the-art results in knowledge-based VQA, achieving an accuracy of 41.8% on Encyclopedic VQA and 31.3% on InfoSeek.

EchoSight: Advancing Visual-Language Models with Wiki Knowledge

TL;DR

EchoSight addresses the challenge of knowledge-based visual question answering by integrating large-scale encyclopedic content through a two-stage retrieval-augmented generation pipeline. It first performs a visual-only retrieval to narrow the knowledge base, then applies a multimodal reranking that leverages both image and text to select the most relevant article sections before generating an answer with an LLM. The approach yields state-of-the-art results on Encyclopedic VQA and InfoSeek, with notable gains in retrieval recall and final VQA accuracy, demonstrating the value of aligning multimodal content with visual context. The work highlights a practical pathway to incorporate precise external knowledge into vision-language systems, albeit with higher computational costs and dependence on knowledge base coverage.

Abstract

Knowledge-based Visual Question Answering (KVQA) tasks require answering questions about images using extensive background knowledge. Despite significant advancements, generative models often struggle with these tasks due to the limited integration of external knowledge. In this paper, we introduce EchoSight, a novel multimodal Retrieval-Augmented Generation (RAG) framework that enables large language models (LLMs) to answer visual questions requiring fine-grained encyclopedic knowledge. To strive for high-performing retrieval, EchoSight first searches wiki articles by using visual-only information, subsequently, these candidate articles are further reranked according to their relevance to the combined text-image query. This approach significantly improves the integration of multimodal knowledge, leading to enhanced retrieval outcomes and more accurate VQA responses. Our experimental results on the Encyclopedic VQA and InfoSeek datasets demonstrate that EchoSight establishes new state-of-the-art results in knowledge-based VQA, achieving an accuracy of 41.8% on Encyclopedic VQA and 31.3% on InfoSeek.
Paper Structure (23 sections, 5 equations, 3 figures, 12 tables)

This paper contains 23 sections, 5 equations, 3 figures, 12 tables.

Figures (3)

  • Figure 1: For visual questions such as “When was the 1st ascent of this mountain?”, visual-only search methods consider image similarity only, ignoring the textual details of the accompanying article. By incorporating multimodal reranking, the correct entry, accounting for both visual and textual information, can be accurately identified.
  • Figure 2: The overall view of our proposed EchoSight. (i) Given a visual question with an image, the retriever searches the reference image in the knowledge base for top $k$ similar images to get their corresponding Wikipedia Entries. (ii) After changing the granularity to sections, all the sections of retrieved entries are then reranked with the maximum pairwise similarity of their textual embeddings and the reference image+question's Q-Former query tokens. (iii) The top reranked section will be utilized as RAG prompt for the LLM to generate the ultimate answer.
  • Figure 3: Qualitative VQA results from Encyclopedic VQA comparing to GPT-4V. The first row shows results in landmarks and the second row in natural species. Some failure cases are shown in the third row altogether with ground-truth.