Fix Before Search: Benchmarking Agentic Query Visual Pre-processing in Multimodal Retrieval-augmented Generation

Jiankun Zhang; Shenglai Zeng; Kai Guo; Xinnan Dai; Hui Liu; Jiliang Tang; Yi Chang

Fix Before Search: Benchmarking Agentic Query Visual Pre-processing in Multimodal Retrieval-augmented Generation

Jiankun Zhang, Shenglai Zeng, Kai Guo, Xinnan Dai, Hui Liu, Jiliang Tang, Yi Chang

TL;DR

V-QPP is formulated as an agentic decision-making task where MLLMs must autonomously diagnose imperfections and deploy perceptual tools to refine queries, demonstrating the benchmark's value for developing robust MRAG systems.

Abstract

Multimodal Retrieval-Augmented Generation (MRAG) has emerged as a key paradigm for grounding MLLMs with external knowledge. While query pre-processing (e.g., rewriting) is standard in text-based RAG, existing MRAG pipelines predominantly treat visual inputs as static and immutable, implicitly assuming they are noise-free. However, real-world visual queries are often ``imperfect'' -- suffering from geometric distortions, quality degradation, or semantic ambiguity -- leading to catastrophic retrieval failures. To address this gap, we propose V-QPP-Bench, the first comprehensive benchmark dedicated to Visual Query Pre-processing (V-QPP). We formulate V-QPP as an agentic decision-making task where MLLMs must autonomously diagnose imperfections and deploy perceptual tools to refine queries. Our extensive evaluation across 46,700 imperfect queries and diverse MRAG paradigms reveals three critical insights: (1) Vulnerability -- visual imperfections severely degrade both retrieval recall and end-to-end MRAG performance; (2) Restoration Potential \& Bottleneck -- while oracle preprocessing recovers near-perfect performance, off-the-shelf MLLMs struggle with tool selection and parameter prediction without specialized training; and (3) Training Enhancement -- supervised fine-tuning enables compact models to achieve comparable or superior performance to larger proprietary models, demonstrating the benchmark's value for developing robust MRAG systems The code is available at https://github.com/phycholosogy/VQQP_Bench

Fix Before Search: Benchmarking Agentic Query Visual Pre-processing in Multimodal Retrieval-augmented Generation

TL;DR

Abstract

Paper Structure (44 sections, 4 equations, 18 figures, 6 tables, 1 algorithm)

This paper contains 44 sections, 4 equations, 18 figures, 6 tables, 1 algorithm.

Introduction
Related Works
RAG and Query Processing
Robustness Benchmarks for MLLMs
Problem Formulation
Preliminaries: Unified MRAG Retrieval Paradigms
Modeling Visual Imperfections
The V-QPP Task
V-QPP-Bench
Data Construction Pipeline
Data Source.
Imperfect Query Generation.
Retrieval Corpus Construction.
The Agentic Environment (Tool Library)
Evaluation Metrics
...and 29 more sections

Figures (18)

Figure 1: Comparison between Standard MRAG and V-QPP pipelines.Top: When facing real-world imperfections—such as a rotated document, a blurry street sign, or a target product hidden in a cluttered shelf—standard encoders fail to capture the user's intent, resulting in irrelevant retrieval (Red Cross). Bottom: The V-QPP pipeline introduces an agentic loop where an MLLM utilizes tools (e.g., cropping the cereal box, deblurring the text) to "clean" the visual query. This active pre-processing step recovers the semantic gap, enabling precise retrieval of the correct entities.
Figure 2: Retrieval performance (Recall@5) degradation in MRAG systems under various imperfect query conditions
Figure 3: Holistic End-to-End Performance Assessment on V-QPP-Bench. Visualization of the final MRAG accuracy across various imperfections. Original Query (Grey Dashed): Standard MRAG performance on the original queries. Imperfect Query (Blue Solid): Standard MRAG performance on the imperfect queries. V-QPP Agent (Red Solid): Performance after active agentic refinement. Oracle (Green Dotted): Theoretical upper bound using optimal tool transformations.
Figure 4: Tool selection accuracy and parameter scores across models. The red dot denotes Qwen3-VL-4B-Instruct after SFT, while solid lines represent off-the-shelf models.
Figure 5: Performance and retrieval recall comparison. Purple line: Qwen3-VL-4B-Instruct after SFT; Red line: off-the-shelf Qwen3-VL-4B-Instruct.
...and 13 more figures

Fix Before Search: Benchmarking Agentic Query Visual Pre-processing in Multimodal Retrieval-augmented Generation

TL;DR

Abstract

Fix Before Search: Benchmarking Agentic Query Visual Pre-processing in Multimodal Retrieval-augmented Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (18)