Table of Contents
Fetching ...

VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation

Yubo Sun, Chunyi Peng, Yukun Yan, Shi Yu, Zhenghao Liu, Chi Chen, Zhiyuan Liu, Maosong Sun

TL;DR

The paper tackles unreliable cross-image grounding in visual retrieval-augmented generation (VRAG) by introducing EVisRAG, which first observes retrieved images, records per-image evidence, and then reasons over the aggregated cues to ground answers. Training uses Reward-Scoped Group Relative Policy Optimization (RS-GRPO), which assigns fine-grained rewards to scope-specific tokens, aligning perception and reasoning in a unified objective. Empirical results across five VQA benchmarks show substantial end-to-end gains over backbone vision-language models and competitive VRAG/VLRM baselines, with notable improvements in evidence localization and reduced hallucinations. The approach demonstrates robust multi-image grounding and practical efficiency, marking a significant advance for reliable visual retrieval-augmented generation systems.

Abstract

Visual retrieval-augmented generation (VRAG) augments vision-language models (VLMs) with external visual knowledge to ground reasoning and reduce hallucinations. Yet current VRAG systems often fail to reliably perceive and integrate evidence across multiple images, leading to weak grounding and erroneous conclusions. In this paper, we propose EVisRAG, an end-to-end framework that learns to reason with evidence-guided multi-image to address this issue. The model first observes retrieved images and records per-image evidence, then derives the final answer from the aggregated evidence. To train EVisRAG effectively, we introduce Reward-Scoped Group Relative Policy Optimization (RS-GRPO), which binds fine-grained rewards to scope-specific tokens to jointly optimize visual perception and reasoning abilities of VLMs. Experimental results on multiple visual question answering benchmarks demonstrate that EVisRAG delivers substantial end-to-end gains over backbone VLM with 27\% improvements on average. Further analysis shows that, powered by RS-GRPO, EVisRAG improves answer accuracy by precisely perceiving and localizing question-relevant evidence across multiple images and deriving the final answer from that evidence, much like a real detective.

VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation

TL;DR

The paper tackles unreliable cross-image grounding in visual retrieval-augmented generation (VRAG) by introducing EVisRAG, which first observes retrieved images, records per-image evidence, and then reasons over the aggregated cues to ground answers. Training uses Reward-Scoped Group Relative Policy Optimization (RS-GRPO), which assigns fine-grained rewards to scope-specific tokens, aligning perception and reasoning in a unified objective. Empirical results across five VQA benchmarks show substantial end-to-end gains over backbone vision-language models and competitive VRAG/VLRM baselines, with notable improvements in evidence localization and reduced hallucinations. The approach demonstrates robust multi-image grounding and practical efficiency, marking a significant advance for reliable visual retrieval-augmented generation systems.

Abstract

Visual retrieval-augmented generation (VRAG) augments vision-language models (VLMs) with external visual knowledge to ground reasoning and reduce hallucinations. Yet current VRAG systems often fail to reliably perceive and integrate evidence across multiple images, leading to weak grounding and erroneous conclusions. In this paper, we propose EVisRAG, an end-to-end framework that learns to reason with evidence-guided multi-image to address this issue. The model first observes retrieved images and records per-image evidence, then derives the final answer from the aggregated evidence. To train EVisRAG effectively, we introduce Reward-Scoped Group Relative Policy Optimization (RS-GRPO), which binds fine-grained rewards to scope-specific tokens to jointly optimize visual perception and reasoning abilities of VLMs. Experimental results on multiple visual question answering benchmarks demonstrate that EVisRAG delivers substantial end-to-end gains over backbone VLM with 27\% improvements on average. Further analysis shows that, powered by RS-GRPO, EVisRAG improves answer accuracy by precisely perceiving and localizing question-relevant evidence across multiple images and deriving the final answer from that evidence, much like a real detective.

Paper Structure

This paper contains 24 sections, 13 equations, 16 figures, 6 tables.

Figures (16)

  • Figure 1: Comparison of normal vision-language reasoning model (VLRM) and EVisRAG
  • Figure 2: Overall framework of EVisRAG. Followed by the query and top-3 retrieved document pages, EVisRAG outputs four token scopes: observe, record evidence, reason, and answer. RS-GRPO assigns three fine-grained rewards to scope-specific tokens. In-scope rewards are then averaged and group-normalized to obtain token advantages for policy updates.
  • Figure 3: Comparison of models’ attention to question-relevant visual evidence. (a) Accuracy vs. attention ratio within human-annotated boxes; EVisRAG achieves the highest. (b) Qualitative maps: Compared with the baseline, EVisRAG better focuses on the top bar encoding the evidence.
  • Figure 4: Performance comparison on different visual evidence density. Despite increasing noise with more retrieved images, EVisRAG maintains stable.
  • Figure 5: Model performance comparisons in different retrieval scenarios on ChartQA. Compared with the backbone, EVisRAG remains more faithful to the retrieved content in both correct and incorrect retrieval scenarios.
  • ...and 11 more figures