Table of Contents
Fetching ...

VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph

Qiuchen Wang, Shihang Wang, Yu Zeng, Qiang Zhang, Fanrui Zhang, Zhuoning Guo, Bosi Zhang, Wenxuan Huang, Lin Chen, Zehui Chen, Pengjun Xie, Ruixue Ding

TL;DR

VRAG is introduced, a framework tailored for multimodal Retrieval-augmented Reasoning across text, images, and videos that disentangles step-wise validity from trajectory-level rewards by pruning memory nodes associated with redundant actions, thereby facilitating fine-grained credit assignment.

Abstract

Effectively retrieving, reasoning, and understanding multimodal information remains a critical challenge for agentic systems. Traditional Retrieval-augmented Generation (RAG) methods rely on linear interaction histories, which struggle to handle long-context tasks, especially those involving information-sparse yet token-heavy visual data in iterative reasoning scenarios. To bridge this gap, we introduce VimRAG, a framework tailored for multimodal Retrieval-augmented Reasoning across text, images, and videos. Inspired by our systematic study, we model the reasoning process as a dynamic directed acyclic graph that structures the agent states and retrieved multimodal evidence. Building upon this structured memory, we introduce a Graph-Modulated Visual Memory Encoding mechanism, with which the significance of memory nodes is evaluated via their topological position, allowing the model to dynamically allocate high-resolution tokens to pivotal evidence while compressing or discarding trivial clues. To implement this paradigm, we propose a Graph-Guided Policy Optimization strategy. This strategy disentangles step-wise validity from trajectory-level rewards by pruning memory nodes associated with redundant actions, thereby facilitating fine-grained credit assignment. Extensive experiments demonstrate that VimRAG consistently achieves state-of-the-art performance on diverse multimodal RAG benchmarks. The code is available at https://github.com/Alibaba-NLP/VRAG.

VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph

TL;DR

VRAG is introduced, a framework tailored for multimodal Retrieval-augmented Reasoning across text, images, and videos that disentangles step-wise validity from trajectory-level rewards by pruning memory nodes associated with redundant actions, thereby facilitating fine-grained credit assignment.

Abstract

Effectively retrieving, reasoning, and understanding multimodal information remains a critical challenge for agentic systems. Traditional Retrieval-augmented Generation (RAG) methods rely on linear interaction histories, which struggle to handle long-context tasks, especially those involving information-sparse yet token-heavy visual data in iterative reasoning scenarios. To bridge this gap, we introduce VimRAG, a framework tailored for multimodal Retrieval-augmented Reasoning across text, images, and videos. Inspired by our systematic study, we model the reasoning process as a dynamic directed acyclic graph that structures the agent states and retrieved multimodal evidence. Building upon this structured memory, we introduce a Graph-Modulated Visual Memory Encoding mechanism, with which the significance of memory nodes is evaluated via their topological position, allowing the model to dynamically allocate high-resolution tokens to pivotal evidence while compressing or discarding trivial clues. To implement this paradigm, we propose a Graph-Guided Policy Optimization strategy. This strategy disentangles step-wise validity from trajectory-level rewards by pruning memory nodes associated with redundant actions, thereby facilitating fine-grained credit assignment. Extensive experiments demonstrate that VimRAG consistently achieves state-of-the-art performance on diverse multimodal RAG benchmarks. The code is available at https://github.com/Alibaba-NLP/VRAG.
Paper Structure (55 sections, 12 equations, 16 figures, 6 tables, 1 algorithm)

This paper contains 55 sections, 12 equations, 16 figures, 6 tables, 1 algorithm.

Figures (16)

  • Figure 1: Inference pipeline of the VimRAG framework. (a) The cyclic inference loop consisting of reasoning, retrieval, and memory evolution. (b) details the Evolution of Structured Reasoning Topology, where each node stores agent-specific memory, including the action, dynamically compressed multimodal observations, and its corresponding temporal and topological structure. (c) illustrates the step-by-step process of Graph-Modulated Visual Memory Encoding. This mechanism mimics human forgetting by integrating temporal, topological, and semantic relevance to adjust vision token density, effectively filtering out noise to preserve truly valuable clues.
  • Figure 2: Quantitative analysis of memory structures. (a) Distribution of total token consumption for complete samples. (b) Count of Invalid Retrieval Action. By modeling the agent's current state rather than just storing facts, the Graph-based paradigm effectively avoids repetitive retrieval compared to the summary-based method.
  • Figure 3: Empirical analysis of misalignment between outcome rewards and step validity. (a) Distribution of step categories across binary outcome rewards. (b) Impact of removing redundancy or evidence steps, demonstrating the coarseness of rewards.
  • Figure 4: Overview of Graph-Guided Policy Optimization.(a) Agentic Memory Training Framework segments rollout trajectories into atomic reasoning cycles within the memory paradigm, where outcome-based advantages are broadcasted to enable step-level credit assignment. (b) Credit Assignment via Graph Pruning leverages the structured graph for precise credit assignment, applying gradient masks to avoid reinforcing inefficient dead-ends in positive samples and prevent penalizing valuable retrievals in negative samples.
  • Figure 5: Ablation on GGPO. Our method is more robust than baseline GSPO without pruning.
  • ...and 11 more figures