Table of Contents
Fetching ...

On the Evaluation and Refinement of Vision-Language Instruction Tuning Datasets

Ning Liao, Shaofeng Zhang, Renqiu Xia, Min Cao, Yu Qiao, Junchi Yan

TL;DR

This work shifts VLIT evaluation from model-centric to dataset-centric analysis by introducing a tune-cross-evaluation paradigm and model-free metrics $MQ$, $DQ$, and $SQ$ to assess VLIT datasets. It argues that high-quality, consistently annotated datasets are crucial for building an all-powerful VLIT model and provides a principled way to refine datasets into REVO-LION, which achieves comparable performance with only a fraction of full data when used for training. The authors validate the paradigm across multiple VLIT architectures, showing that dataset merger generally improves comprehensive capability, while careful sample selection (high $SQ$) yields similar or better results with less data. REVO-LION is released with an evaluation set designed as a practical benchmark, enabling robust, scalable VLIT research and benchmarking without relying on external judgment or human ratings. The work offers a foundation for grounded VLIT benchmarking and highlights the importance of data quality and selection strategy for advancing multimodal instruction-tuning systems.

Abstract

There is an emerging line of research on multimodal instruction tuning, and a line of benchmarks has been proposed for evaluating these models recently. Instead of evaluating the models directly, in this paper, we try to evaluate the Vision-Language Instruction-Tuning (VLIT) datasets. Also, we seek the way of building a dataset for developing an all-powerful VLIT model, which we believe could also be of utility for establishing a grounded protocol for benchmarking VLIT models. For effective evaluation of VLIT datasets that remains an open question, we propose a tune-cross-evaluation paradigm: tuning on one dataset and evaluating on the others in turn. For each single tune-evaluation experiment set, we define the Meta Quality (MQ) as the mean score obtained by a set of caption metrics including BLEU, METEOR, and ROUGE-L to quantify the quality of a certain dataset or a sample. On this basis, to evaluate the comprehensiveness of a dataset, we develop the Dataset Quality (DQ) covering all tune-evaluation sets. To lay the foundation for building a comprehensive dataset and developing an all-powerful model for practical applications, we define the Sample Quality (SQ) to quantify the all-sided quality of each sample. Extensive experiments validate the rationality of the proposed evaluation paradigm. Based on the holistic evaluation, we build a new dataset, REVO-LION (REfining VisiOn-Language InstructiOn tuNing), by collecting samples with higher SQ from each dataset. Remarkably, even with only half of the complete data, the model trained on REVO-LION can achieve the performance comparable to simply adding all VLIT datasets up. Furthermore, REVO-LION not only facilitates the development of a powerful model but also incorporates an evaluation set, which is designed to serve as a convenient benchmark for future research in the field.

On the Evaluation and Refinement of Vision-Language Instruction Tuning Datasets

TL;DR

This work shifts VLIT evaluation from model-centric to dataset-centric analysis by introducing a tune-cross-evaluation paradigm and model-free metrics , , and to assess VLIT datasets. It argues that high-quality, consistently annotated datasets are crucial for building an all-powerful VLIT model and provides a principled way to refine datasets into REVO-LION, which achieves comparable performance with only a fraction of full data when used for training. The authors validate the paradigm across multiple VLIT architectures, showing that dataset merger generally improves comprehensive capability, while careful sample selection (high ) yields similar or better results with less data. REVO-LION is released with an evaluation set designed as a practical benchmark, enabling robust, scalable VLIT research and benchmarking without relying on external judgment or human ratings. The work offers a foundation for grounded VLIT benchmarking and highlights the importance of data quality and selection strategy for advancing multimodal instruction-tuning systems.

Abstract

There is an emerging line of research on multimodal instruction tuning, and a line of benchmarks has been proposed for evaluating these models recently. Instead of evaluating the models directly, in this paper, we try to evaluate the Vision-Language Instruction-Tuning (VLIT) datasets. Also, we seek the way of building a dataset for developing an all-powerful VLIT model, which we believe could also be of utility for establishing a grounded protocol for benchmarking VLIT models. For effective evaluation of VLIT datasets that remains an open question, we propose a tune-cross-evaluation paradigm: tuning on one dataset and evaluating on the others in turn. For each single tune-evaluation experiment set, we define the Meta Quality (MQ) as the mean score obtained by a set of caption metrics including BLEU, METEOR, and ROUGE-L to quantify the quality of a certain dataset or a sample. On this basis, to evaluate the comprehensiveness of a dataset, we develop the Dataset Quality (DQ) covering all tune-evaluation sets. To lay the foundation for building a comprehensive dataset and developing an all-powerful model for practical applications, we define the Sample Quality (SQ) to quantify the all-sided quality of each sample. Extensive experiments validate the rationality of the proposed evaluation paradigm. Based on the holistic evaluation, we build a new dataset, REVO-LION (REfining VisiOn-Language InstructiOn tuNing), by collecting samples with higher SQ from each dataset. Remarkably, even with only half of the complete data, the model trained on REVO-LION can achieve the performance comparable to simply adding all VLIT datasets up. Furthermore, REVO-LION not only facilitates the development of a powerful model but also incorporates an evaluation set, which is designed to serve as a convenient benchmark for future research in the field.
Paper Structure (21 sections, 6 equations, 25 figures, 8 tables)

This paper contains 21 sections, 6 equations, 25 figures, 8 tables.

Figures (25)

  • Figure 1: The popular architecture in current vision-language instruction tuning methods dai2023instructblipliu2023visual. Extracting the visual feature by a frozen image encoder, transferring the visual feature into the language space using an optimizable projection module, and generating text output via a frozen Large Language Model (LLM).
  • Figure 2: The overall framework of the proposed tune-cross-evaluation paradigm. Left: The diagram of Dataset Quality (DQ) evaluation. Each dataset adopted for testing measures the quality of the tuning dataset $D_T$ on the aspect that the testing datasets are constructed towards. Right: The diagram of Sample Quality (SQ) evaluation. Each dataset used for tuning measures how well the samples in the testing set $D_E$ match with the ability that the tuning dataset is constructed towards.
  • Figure 3: The diagram of the data split process. It is designed to validate the effectiveness of the proposed tune-cross-evaluation paradigm and the data refinement strategy in main experiments. Each original dataset is divided into two parts: $80\%$ samples are collected as a tuning set for data evaluation and refinement, and 600 samples from the remaining $20\%$ are collected into a balanced and comprehensive evaluation set. For robust validation, we perform such partitions twice, thus creating SPLIT1 and SPLIT2 that are used in the main experiments.
  • Figure 4: Visualizations of $MQ^D_{T \rightarrow i}(i \neq T)$ in dataset quality evaluation. Lines with different colors represent different datasets $D_T$ used for instruction tuning.
  • Figure 5: Three samples in LLaVA-Reasoning. (a) and (b) are easy reasoning problems, and similar to problems of describing images. (c) is a hard reasoning problem requiring logical thoughts. Q: question. A: answer.
  • ...and 20 more figures