Table of Contents
Fetching ...

MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training

Zhanpeng Chen, Chengjin Xu, Yiyan Qi, Jian Guo

TL;DR

RagVL addresses the knowledge-update and multi-granularity noise challenges in multimodal retrieval-augmented generation by introducing a three-stage pipeline: a CLIP-based retriever, a knowledge-enhanced reranker trained via caption-aware instruction tuning, and a noise-injected generator that learns to handle visual uncertainty. The approach employs a ranking data construction process, adaptive thresholding, and diffusion-based noise contrast to improve both retrieval accuracy and generation robustness. Experiments on WebQA and MultimodalQA demonstrate substantial improvements in retrieval recalls and near-oracle generation performance, with strong generalization in low-resource settings and across caption-to-image benchmarks. The work advances practical MLLM-based multimodal QA by enabling more up-to-date, reliable, and contextually grounded responses, while also highlighting efficiency considerations and potential deployment optimizations.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in processing and generating content across multiple data modalities. However, a significant drawback of MLLMs is their reliance on static training data, leading to outdated information and limited contextual awareness. This static nature hampers their ability to provide accurate and up-to-date responses, particularly in dynamic or rapidly evolving contexts. Though integrating Multimodal Retrieval-augmented Generation (Multimodal RAG) offers a promising solution, the system would inevitably encounter the multi-granularity noisy correspondence (MNC) problem, which hinders accurate retrieval and generation. In this work, we propose RagVL, a novel framework with knowledge-enhanced reranking and noise-injected training, to address these limitations. We instruction-tune the MLLM with a simple yet effective instruction template to induce its ranking ability and serve it as a reranker to precisely filter the top-k retrieved images. For generation, we inject visual noise during training at the data and token levels to enhance the generator's robustness. Extensive experiments on the subsets of two datasets that require retrieving and reasoning over images to answer a given query verify the effectiveness of our method. Code and models are available at https://github.com/IDEA-FinAI/RagVL.

MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training

TL;DR

RagVL addresses the knowledge-update and multi-granularity noise challenges in multimodal retrieval-augmented generation by introducing a three-stage pipeline: a CLIP-based retriever, a knowledge-enhanced reranker trained via caption-aware instruction tuning, and a noise-injected generator that learns to handle visual uncertainty. The approach employs a ranking data construction process, adaptive thresholding, and diffusion-based noise contrast to improve both retrieval accuracy and generation robustness. Experiments on WebQA and MultimodalQA demonstrate substantial improvements in retrieval recalls and near-oracle generation performance, with strong generalization in low-resource settings and across caption-to-image benchmarks. The work advances practical MLLM-based multimodal QA by enabling more up-to-date, reliable, and contextually grounded responses, while also highlighting efficiency considerations and potential deployment optimizations.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in processing and generating content across multiple data modalities. However, a significant drawback of MLLMs is their reliance on static training data, leading to outdated information and limited contextual awareness. This static nature hampers their ability to provide accurate and up-to-date responses, particularly in dynamic or rapidly evolving contexts. Though integrating Multimodal Retrieval-augmented Generation (Multimodal RAG) offers a promising solution, the system would inevitably encounter the multi-granularity noisy correspondence (MNC) problem, which hinders accurate retrieval and generation. In this work, we propose RagVL, a novel framework with knowledge-enhanced reranking and noise-injected training, to address these limitations. We instruction-tune the MLLM with a simple yet effective instruction template to induce its ranking ability and serve it as a reranker to precisely filter the top-k retrieved images. For generation, we inject visual noise during training at the data and token levels to enhance the generator's robustness. Extensive experiments on the subsets of two datasets that require retrieving and reasoning over images to answer a given query verify the effectiveness of our method. Code and models are available at https://github.com/IDEA-FinAI/RagVL.
Paper Structure (37 sections, 7 equations, 14 figures, 13 tables)

This paper contains 37 sections, 7 equations, 14 figures, 13 tables.

Figures (14)

  • Figure 1: Difference between traditional VQA and multimodal knowledge-seeking question answering. An example from WebQA chang2022webqa reveals the challenge of multi-granularity noisy correspondence (MNC).
  • Figure 2: Overview of our proposed RagVL. In the retrieval stage, we utilize the CLIP model and faiss to find the top-$K$ most relevant images through Maximum Inner Product Search (MIPS) guo2020accelerating. Subsequently, the highly similar top-$K$ images are reranked into top-$N$ with the fine-tuned MLLM reranker. Finally, the top-$N$ images are fed into the MLLM generator along with the query for accurate generation.
  • Figure 3: Generalizabilities of caption-aware instruction tuning. (a) compares the performance of the reranker fine-tuned on WebQA with the one fine-tuned on MultimodalQA, evaluated on MultimodalQA. (b) visualizes the changes in the probability distribution of correctly recalled items and the recall rate of the reranker under low-resource settings as the scale of the training dataset varies.
  • Figure 4: Density distribution of the relevance probability of correct and incorrect recalls on WebQA after reranking from the InternVL2-2B reranker.
  • Figure 5: "How many primary colors are found on the head of the Violet Turaco?"
  • ...and 9 more figures