EchoSight: Advancing Visual-Language Models with Wiki Knowledge
Yibin Yan, Weidi Xie
TL;DR
EchoSight addresses the challenge of knowledge-based visual question answering by integrating large-scale encyclopedic content through a two-stage retrieval-augmented generation pipeline. It first performs a visual-only retrieval to narrow the knowledge base, then applies a multimodal reranking that leverages both image and text to select the most relevant article sections before generating an answer with an LLM. The approach yields state-of-the-art results on Encyclopedic VQA and InfoSeek, with notable gains in retrieval recall and final VQA accuracy, demonstrating the value of aligning multimodal content with visual context. The work highlights a practical pathway to incorporate precise external knowledge into vision-language systems, albeit with higher computational costs and dependence on knowledge base coverage.
Abstract
Knowledge-based Visual Question Answering (KVQA) tasks require answering questions about images using extensive background knowledge. Despite significant advancements, generative models often struggle with these tasks due to the limited integration of external knowledge. In this paper, we introduce EchoSight, a novel multimodal Retrieval-Augmented Generation (RAG) framework that enables large language models (LLMs) to answer visual questions requiring fine-grained encyclopedic knowledge. To strive for high-performing retrieval, EchoSight first searches wiki articles by using visual-only information, subsequently, these candidate articles are further reranked according to their relevance to the combined text-image query. This approach significantly improves the integration of multimodal knowledge, leading to enhanced retrieval outcomes and more accurate VQA responses. Our experimental results on the Encyclopedic VQA and InfoSeek datasets demonstrate that EchoSight establishes new state-of-the-art results in knowledge-based VQA, achieving an accuracy of 41.8% on Encyclopedic VQA and 31.3% on InfoSeek.
