Table of Contents
Fetching ...

MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models

Wenbo Hu, Jia-Chen Gu, Zi-Yi Dou, Mohsen Fayyaz, Pan Lu, Kai-Wei Chang, Nanyun Peng

TL;DR

MRAG-Bench introduces a vision-centric retrieval-augmented evaluation framework for LVLMs, addressing a gap where visual knowledge can outperform textual retrieval. It presents 9 scenarios across perspective and transformative aspects, with a ground-truth image corpus and 16,130 images to rigorously test visually augmented reasoning. Experimental results show consistent improvements from visual retrieval, yet reveal a notable gap between open-source and proprietary models in utilizing retrieved visuals, and that humans still outperform AI in leveraging GT knowledge. The work highlights the need for LVLMs to better discriminate high-quality visual evidence and adaptively use visual knowledge, offering a foundation for future vision-centric RAG research and benchmarks.

Abstract

Existing multimodal retrieval benchmarks primarily focus on evaluating whether models can retrieve and utilize external textual knowledge for question answering. However, there are scenarios where retrieving visual information is either more beneficial or easier to access than textual data. In this paper, we introduce a multimodal retrieval-augmented generation benchmark, MRAG-Bench, in which we systematically identify and categorize scenarios where visually augmented knowledge is better than textual knowledge, for instance, more images from varying viewpoints. MRAG-Bench consists of 16,130 images and 1,353 human-annotated multiple-choice questions across 9 distinct scenarios. With MRAG-Bench, we conduct an evaluation of 10 open-source and 4 proprietary large vision-language models (LVLMs). Our results show that all LVLMs exhibit greater improvements when augmented with images compared to textual knowledge, confirming that MRAG-Bench is vision-centric. Additionally, we conduct extensive analysis with MRAG-Bench, which offers valuable insights into retrieval-augmented LVLMs. Notably, the top-performing model, GPT-4o, faces challenges in effectively leveraging retrieved knowledge, achieving only a 5.82% improvement with ground-truth information, in contrast to a 33.16% improvement observed in human participants. These findings highlight the importance of MRAG-Bench in encouraging the community to enhance LVLMs' ability to utilize retrieved visual knowledge more effectively.

MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models

TL;DR

MRAG-Bench introduces a vision-centric retrieval-augmented evaluation framework for LVLMs, addressing a gap where visual knowledge can outperform textual retrieval. It presents 9 scenarios across perspective and transformative aspects, with a ground-truth image corpus and 16,130 images to rigorously test visually augmented reasoning. Experimental results show consistent improvements from visual retrieval, yet reveal a notable gap between open-source and proprietary models in utilizing retrieved visuals, and that humans still outperform AI in leveraging GT knowledge. The work highlights the need for LVLMs to better discriminate high-quality visual evidence and adaptively use visual knowledge, offering a foundation for future vision-centric RAG research and benchmarks.

Abstract

Existing multimodal retrieval benchmarks primarily focus on evaluating whether models can retrieve and utilize external textual knowledge for question answering. However, there are scenarios where retrieving visual information is either more beneficial or easier to access than textual data. In this paper, we introduce a multimodal retrieval-augmented generation benchmark, MRAG-Bench, in which we systematically identify and categorize scenarios where visually augmented knowledge is better than textual knowledge, for instance, more images from varying viewpoints. MRAG-Bench consists of 16,130 images and 1,353 human-annotated multiple-choice questions across 9 distinct scenarios. With MRAG-Bench, we conduct an evaluation of 10 open-source and 4 proprietary large vision-language models (LVLMs). Our results show that all LVLMs exhibit greater improvements when augmented with images compared to textual knowledge, confirming that MRAG-Bench is vision-centric. Additionally, we conduct extensive analysis with MRAG-Bench, which offers valuable insights into retrieval-augmented LVLMs. Notably, the top-performing model, GPT-4o, faces challenges in effectively leveraging retrieved knowledge, achieving only a 5.82% improvement with ground-truth information, in contrast to a 33.16% improvement observed in human participants. These findings highlight the importance of MRAG-Bench in encouraging the community to enhance LVLMs' ability to utilize retrieved visual knowledge more effectively.

Paper Structure

This paper contains 35 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Example scenarios from MRAG-Bench. Previous benchmarks chang2022webqamultihopmultimodalqaencvqachen2023infoseek mainly focused on retrieving from textual knowledge. However, there are scenarios where retrieving correct textual knowledge is hard and sometimes not as useful as visual knowledge.
  • Figure 2: Key statistics of MRAG-Bench.
  • Figure 3: Qualitative examples on MRAG-Bench. For each scenario, we show the result of GPT-4o gpt4, Gemini Pro team2023gemini, LLaVA-Next-Interleave li2024llavanextinterleavetacklingmultiimagevideo and Mantis-8B-Siglip jiang2024mantis. The ground-truth answer is in blue.
  • Figure 4: Qualitative Example of Proprietary model (Gemini Pro) identifies and utilizes correct examples, while open-source model (LLaVA-Next-Interleave) is misled by noisy retrieved information, resulting in incorrect answers.
  • Figure 5: Left: LLaVA-Next-Interleave results with 4 different multimodal retrievers. Its performance using retrieved images correlates 95% with retriever's Recall@5 scores. Right: Average results of three random seed runs. Improve the number of ground-truth RAG examples shows steady increase of model's performance, reaches the maximum with 10 examples.
  • ...and 2 more figures