Table of Contents
Fetching ...

VLR-Bench: Multilingual Benchmark Dataset for Vision-Language Retrieval Augmented Generation

Hyeonseok Lim, Dongjae Shin, Seohyun Song, Inho Won, Minjun Kim, Junghun Yuk, Haneol Jang, KyungTae Lim

TL;DR

VLR-Bench introduces a multilingual benchmark to evaluate retrieval-augmented generation in vision-language models, featuring five passages per sample to test which passages are useful for answering image-guided queries. The accompanying VLR-IF dataset provides instruction-following data to strengthen external-knowledge utilization across English, Chinese, and Korean, enabling learning-to-select relevant information. Through experiments with Llava-Llama-3 and GPT-4o, the study shows that effective passage selection and external-knowledge use substantially impact performance, and that VLR-IF training yields notable improvements across metrics and benchmarks such as InfoSeek. The work provides a publicly available, multilingual evaluation framework that emphasizes passage quality (gold/silver/bronze) and retrieval-augmented reasoning, advancing practical assessment and improvement of VLMs in knowledge-intensive multimodal tasks.

Abstract

We propose the VLR-Bench, a visual question answering (VQA) benchmark for evaluating vision language models (VLMs) based on retrieval augmented generation (RAG). Unlike existing evaluation datasets for external knowledge-based VQA, the proposed VLR-Bench includes five input passages. This allows testing of the ability to determine which passage is useful for answering a given query, a capability lacking in previous research. In this context, we constructed a dataset of 32,000 automatically generated instruction-following examples, which we denote as VLR-IF. This dataset is specifically designed to enhance the RAG capabilities of VLMs by enabling them to learn how to generate appropriate answers based on input passages. We evaluated the validity of the proposed benchmark and training data and verified its performance using the state-of-the-art Llama3-based VLM, the Llava-Llama-3 model. The proposed VLR-Bench and VLR-IF datasets are publicly available online.

VLR-Bench: Multilingual Benchmark Dataset for Vision-Language Retrieval Augmented Generation

TL;DR

VLR-Bench introduces a multilingual benchmark to evaluate retrieval-augmented generation in vision-language models, featuring five passages per sample to test which passages are useful for answering image-guided queries. The accompanying VLR-IF dataset provides instruction-following data to strengthen external-knowledge utilization across English, Chinese, and Korean, enabling learning-to-select relevant information. Through experiments with Llava-Llama-3 and GPT-4o, the study shows that effective passage selection and external-knowledge use substantially impact performance, and that VLR-IF training yields notable improvements across metrics and benchmarks such as InfoSeek. The work provides a publicly available, multilingual evaluation framework that emphasizes passage quality (gold/silver/bronze) and retrieval-augmented reasoning, advancing practical assessment and improvement of VLMs in knowledge-intensive multimodal tasks.

Abstract

We propose the VLR-Bench, a visual question answering (VQA) benchmark for evaluating vision language models (VLMs) based on retrieval augmented generation (RAG). Unlike existing evaluation datasets for external knowledge-based VQA, the proposed VLR-Bench includes five input passages. This allows testing of the ability to determine which passage is useful for answering a given query, a capability lacking in previous research. In this context, we constructed a dataset of 32,000 automatically generated instruction-following examples, which we denote as VLR-IF. This dataset is specifically designed to enhance the RAG capabilities of VLMs by enabling them to learn how to generate appropriate answers based on input passages. We evaluated the validity of the proposed benchmark and training data and verified its performance using the state-of-the-art Llama3-based VLM, the Llava-Llama-3 model. The proposed VLR-Bench and VLR-IF datasets are publicly available online.

Paper Structure

This paper contains 42 sections, 12 figures, 11 tables.

Figures (12)

  • Figure 1: An example of VLR-Bench data sample.
  • Figure 2: Examples of the created VLR-Bench data. (English culture)
  • Figure 3: Examples of the created VLR-Bench data. (commonsense knowledge)
  • Figure 4: Examples of the created VLR-Bench data. (Korean culture)
  • Figure 5: Examples of the created VLR-Bench data. (commonsense knowledge)
  • ...and 7 more figures