Table of Contents
Fetching ...

MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning

Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Jiayi Ji, Jie Lou, Debing Zhang, Rongrong Ji

TL;DR

This work tackles data efficiency in Visual Instruction Tuning by introducing MLLM-Selector, a two-stage, necessity-and-diversity driven data selection framework for multimodal large language models. It seeds an initial instruction-following capability via random sampling, then computes a necessity score for each sample to guide a grouped sampling procedure that preserves diversity. Empirical results under identical settings show MLLM-Selector outperforms LLaVA-1.5 with less than 1% of the data in some benchmarks and with less than 50% in all validated benchmarks, with notable gains on several tasks using equal data amounts (e.g., +14.54% on DOCVQA, +25.36% on ChartQA). The approach is robust across LLM sizes and emphasizes data pool composition, enabling significantly more efficient VIT training and improved downstream performance.

Abstract

Visual instruction tuning (VIT) has emerged as a crucial technique for enabling multi-modal large language models (MLLMs) to follow user instructions adeptly. Yet, a significant gap persists in understanding the attributes of high-quality instruction tuning data and frameworks for its automated selection. To address this, we introduce MLLM-Selector, an automated approach that identifies valuable data for VIT by weighing necessity and diversity. Our process starts by randomly sampling a subset from the VIT data pool to fine-tune a pretrained model, thus creating a seed model with an initial ability to follow instructions. Then, leveraging the seed model, we calculate necessity scores for each sample in the VIT data pool to identify samples pivotal for enhancing model performance. Our findings underscore the importance of mixing necessity and diversity in data choice, leading to the creation of MLLM-Selector, our methodology that fuses necessity scoring with strategic sampling for superior data refinement. Empirical results indicate that within identical experimental conditions, MLLM-Selector surpasses LLaVA-1.5 in some benchmarks with less than 1% of the data and consistently exceeds performance across all validated benchmarks when using less than 50%.

MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning

TL;DR

This work tackles data efficiency in Visual Instruction Tuning by introducing MLLM-Selector, a two-stage, necessity-and-diversity driven data selection framework for multimodal large language models. It seeds an initial instruction-following capability via random sampling, then computes a necessity score for each sample to guide a grouped sampling procedure that preserves diversity. Empirical results under identical settings show MLLM-Selector outperforms LLaVA-1.5 with less than 1% of the data in some benchmarks and with less than 50% in all validated benchmarks, with notable gains on several tasks using equal data amounts (e.g., +14.54% on DOCVQA, +25.36% on ChartQA). The approach is robust across LLM sizes and emphasizes data pool composition, enabling significantly more efficient VIT training and improved downstream performance.

Abstract

Visual instruction tuning (VIT) has emerged as a crucial technique for enabling multi-modal large language models (MLLMs) to follow user instructions adeptly. Yet, a significant gap persists in understanding the attributes of high-quality instruction tuning data and frameworks for its automated selection. To address this, we introduce MLLM-Selector, an automated approach that identifies valuable data for VIT by weighing necessity and diversity. Our process starts by randomly sampling a subset from the VIT data pool to fine-tune a pretrained model, thus creating a seed model with an initial ability to follow instructions. Then, leveraging the seed model, we calculate necessity scores for each sample in the VIT data pool to identify samples pivotal for enhancing model performance. Our findings underscore the importance of mixing necessity and diversity in data choice, leading to the creation of MLLM-Selector, our methodology that fuses necessity scoring with strategic sampling for superior data refinement. Empirical results indicate that within identical experimental conditions, MLLM-Selector surpasses LLaVA-1.5 in some benchmarks with less than 1% of the data and consistently exceeds performance across all validated benchmarks when using less than 50%.

Paper Structure

This paper contains 27 sections, 6 equations, 6 figures, 9 tables, 2 algorithms.

Figures (6)

  • Figure 1: The VIT dataset might include low-value or erroneous data, potentially impairing MLLM performance.
  • Figure 2: Overview of MLLM-Selector, which consists of two stages: Stage 1: Initial Seed Data Selection via Random Sampling and Stage 2: Necessity Data Selection through Necessity-Based Grouped Sampling.
  • Figure 3: Performance comparison of different visual instruction tuning data composition methods. The highest score for each benchmark is highlighted.
  • Figure 4: The relationship between data volume and performance across various benchmarks: VizWizVQA gurari2018vizwiz, ScienceQA IMG lu2022learn, TextVQA singh2019towards, DOCVQA mathew2021docvqa, ChartQA masry2022chartqa, AI2D kembhavi2016diagram, POPE li2023evaluating, SeedBench 2 li2023seed2, Ferret you2023ferret, MMVet yu2023mm, HallusionBenchmark guan2024hallusionbench, GQA hudson2019gqa, MME Cognition fu2023mme, MME Perception fu2023mme, OKVQA marino2019ok, and SeedBench IMG li2023seed. The data shows how performance metrics change as the data volume increases from 0 to 665 thousand samples for each benchmark.
  • Figure 5: Examples of samples with low necessity scores, categorized into three types: (a) Samples with noticeable differences in options, leading to easily solvable problems. (b) Samples where the problem is loosely related to the image, resulting in straightforward answers. (c) Samples with overly simple questions about charts.
  • ...and 1 more figures