VISA: Retrieval Augmented Generation with Visual Source Attribution

Xueguang Ma; Shengyao Zhuang; Bevan Koopman; Guido Zuccon; Wenhu Chen; Jimmy Lin

VISA: Retrieval Augmented Generation with Visual Source Attribution

Xueguang Ma, Shengyao Zhuang, Bevan Koopman, Guido Zuccon, Wenhu Chen, Jimmy Lin

TL;DR

VISA addresses the challenge of verifiability in retrieval-augmented generation by enabling visual, region-level evidence attribution within document screenshots using vision-language models. It introduces two benchmarks, Wiki-VISA and Paper-VISA, to train and evaluate visual grounding of answers and bounding boxes in diverse document modalities. Through fine-tuning on these datasets and data augmentation with additional sources, VISA significantly improves bounding-box accuracy and answer grounding, while revealing gaps in zero-shot generalization and complex multi-document scenarios. The work demonstrates a practical, end-to-end approach for verifiable RAG, laying groundwork for broader applicability and further refinement of visual evidence localization.

Abstract

Generation with source attribution is important for enhancing the verifiability of retrieval-augmented generation (RAG) systems. However, existing approaches in RAG primarily link generated content to document-level references, making it challenging for users to locate evidence among multiple content-rich retrieved documents. To address this challenge, we propose Retrieval-Augmented Generation with Visual Source Attribution (VISA), a novel approach that combines answer generation with visual source attribution. Leveraging large vision-language models (VLMs), VISA identifies the evidence and highlights the exact regions that support the generated answers with bounding boxes in the retrieved document screenshots. To evaluate its effectiveness, we curated two datasets: Wiki-VISA, based on crawled Wikipedia webpage screenshots, and Paper-VISA, derived from PubLayNet and tailored to the medical domain. Experimental results demonstrate the effectiveness of VISA for visual source attribution on documents' original look, as well as highlighting the challenges for improvement. Code, data, and model checkpoints will be released.

VISA: Retrieval Augmented Generation with Visual Source Attribution

TL;DR

Abstract

VISA: Retrieval Augmented Generation with Visual Source Attribution

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)