SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM

Jielin Qiu; Andrea Madotto; Zhaojiang Lin; Paul A. Crook; Yifan Ethan Xu; Xin Luna Dong; Christos Faloutsos; Lei Li; Babak Damavandi; Seungwhan Moon

SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM

Jielin Qiu, Andrea Madotto, Zhaojiang Lin, Paul A. Crook, Yifan Ethan Xu, Xin Luna Dong, Christos Faloutsos, Lei Li, Babak Damavandi, Seungwhan Moon

TL;DR

This paper introduces SnapNTell, an entity-centric visual question answering benchmark designed to probe accurate recognition and knowledge-rich responses for long-tail real-world entities. It presents a retrieval-augmented multimodal LLM that integrates semantic region extraction, image-based entity recognition, and multi-source knowledge retrieval to generate grounded, knowledge-intensive answers. The SnapNTell dataset comprises 22 categories with 7,568 entities, 10 images per entity, and 10 knowledge-driven QA pairs per entity, featuring carefully curated quality and anonymity to challenge current models. Empirical results show substantial improvements over baselines across multiple metrics, with notable gains for tail entities and strong human evaluation alignment, highlighting the method’s potential to reduce hallucinations and improve factuality in VQA. The work emphasizes the practical significance of combining retrieval with multimodal reasoning to better handle real-world, entity-specific knowledge tasks.

Abstract

Vision-extended LLMs have made significant strides in Visual Question Answering (VQA). Despite these advancements, VLLMs still encounter substantial difficulties in handling queries involving long-tail entities, with a tendency to produce erroneous or hallucinated responses. In this work, we introduce a novel evaluative benchmark named \textbf{SnapNTell}, specifically tailored for entity-centric VQA. This task aims to test the models' capabilities in identifying entities and providing detailed, entity-specific knowledge. We have developed the \textbf{SnapNTell Dataset}, distinct from traditional VQA datasets: (1) It encompasses a wide range of categorized entities, each represented by images and explicitly named in the answers; (2) It features QA pairs that require extensive knowledge for accurate responses. The dataset is organized into 22 major categories, containing 7,568 unique entities in total. For each entity, we curated 10 illustrative images and crafted 10 knowledge-intensive QA pairs. To address this novel task, we devised a scalable, efficient, and transparent retrieval-augmented multimodal LLM. Our approach markedly outperforms existing methods on the SnapNTell dataset, achieving a 66.5\% improvement in the BELURT score. We will soon make the dataset and the source code publicly accessible.

SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM

TL;DR

Abstract

Paper Structure (50 sections, 4 equations, 11 figures, 11 tables)

This paper contains 50 sections, 4 equations, 11 figures, 11 tables.

Introduction
Related Works
Knowledge-based VQA
Multimodal LLMs
Retrieval-augmented LLM
Open-domain visual entity recognition
SnapNTell Dataset
Entity Categorization
Image collection
Filtering
Knowledge-intensive Question-Answer Pairs
Quality and consistency
Statistics and Analysis of Our Dataset
Entity statistics
Popularity
...and 35 more sections

Figures (11)

Figure 1: Comparing SnapNTell with existing methods reveals a distinctive focus. In the SnapNTell benchmark, the answers are predominantly entity-centric, characterized by a greater depth of knowledgeable information pertaining to the specific entity depicted in the image as the answer.
Figure 2: Comparison with existing datasets, where previous VQA datasets mostly focus on freeform answers (such as yes/no for verification questions and choice for selection questions).
Figure 3: Our SnapNTell model architecture takes an image-question pair as input. It begins with retrieval augmentation to source relevant information about the entity in the image. This information, along with the question, feeds into the word embedding layer. Text embeddings merge with image-projected embeddings before entering the LLM, culminating in a knowledgeable answer as the output.
Figure 4: Human evaluation results on pairwise comparisons (% win, tie, lose) with baseline outputs against the manually annotated ground-truth from SnapNTell.
Figure 5: The pertinent information collected during dataset building, i.e., from Wikipedia for each entity, which includes the summary of the general introduction, toponym, lococation information, and so on.
...and 6 more figures

SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM

TL;DR

Abstract

SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM

Authors

TL;DR

Abstract

Table of Contents

Figures (11)