Table of Contents
Fetching ...

Concept-skill Transferability-based Data Selection for Large Vision-Language Models

Jaewoo Lee, Boyang Li, Sung Ju Hwang

TL;DR

COINCIDE tackles the high data and compute burden of visual instruction tuning for Large Vision-Language Models by clustering VIT data with activations from a small reference model to uncover concept-skill compositions, then selecting data per cluster based on transferability proxy and density. Data from transferable, lower-density clusters are sampled more heavily (with $P_i \\propto \exp\left( S_i / (\\tau D_i) \right)$) while intra-cluster sampling minimizes $\text{MMD}^2$ to preserve distributional similarity. Across LLaVA-1.5 and Vision-Flan, COINCIDE achieves comparable or superior performance using only $16.7$–$20\%$ of the data and substantially reduces wall-clock training time, demonstrating strong generalization due to diversified concept-skill coverage and transferability-aware sampling. The approach highlights the importance of data quality and diversity for LVLM generalization and offers a scalable, model-light data selection paradigm applicable to large multimodal datasets.

Abstract

Instruction tuning, or supervised finetuning on extensive task-specific data, is necessary for Large Vision-Language Models (LVLMs) to generalize well across a broad range of vision-language (VL) tasks. However, training on large VL datasets can become prohibitively expensive. In this work, we introduce COINCIDE, an effective and scalable data selection technique that uses a small model as a reference model to select visual instruction tuning data for efficient finetuning of a target LVLM, focusing on diversity and transferability. Specifically, we cluster the training data using internal activations from a small model, which identifies VL concept-skill compositions needed by a target LVLM. We then sample data from these diverse clusters by considering their density and transferability, or the ability to transfer well to other concept-skill compositions. This approach ensures the diversity of these compositions, which is vital for LVLM generalization. Extensive experiments demonstrate that COINCIDE achieves superior performance and data selection efficiency against 8 strong baselines on two distinct datasets: LLaVA-1.5 and Vision-Flan. Using only 20% of the LLaVA-1.5 dataset, COINCIDE achieves performance comparable to the LVLM finetuned on the whole dataset, with 70% reduction of the wall-clock running time. On the Vision-Flan dataset, our method achieves superior results with only 16.7% of the training data.

Concept-skill Transferability-based Data Selection for Large Vision-Language Models

TL;DR

COINCIDE tackles the high data and compute burden of visual instruction tuning for Large Vision-Language Models by clustering VIT data with activations from a small reference model to uncover concept-skill compositions, then selecting data per cluster based on transferability proxy and density. Data from transferable, lower-density clusters are sampled more heavily (with ) while intra-cluster sampling minimizes to preserve distributional similarity. Across LLaVA-1.5 and Vision-Flan, COINCIDE achieves comparable or superior performance using only of the data and substantially reduces wall-clock training time, demonstrating strong generalization due to diversified concept-skill coverage and transferability-aware sampling. The approach highlights the importance of data quality and diversity for LVLM generalization and offers a scalable, model-light data selection paradigm applicable to large multimodal datasets.

Abstract

Instruction tuning, or supervised finetuning on extensive task-specific data, is necessary for Large Vision-Language Models (LVLMs) to generalize well across a broad range of vision-language (VL) tasks. However, training on large VL datasets can become prohibitively expensive. In this work, we introduce COINCIDE, an effective and scalable data selection technique that uses a small model as a reference model to select visual instruction tuning data for efficient finetuning of a target LVLM, focusing on diversity and transferability. Specifically, we cluster the training data using internal activations from a small model, which identifies VL concept-skill compositions needed by a target LVLM. We then sample data from these diverse clusters by considering their density and transferability, or the ability to transfer well to other concept-skill compositions. This approach ensures the diversity of these compositions, which is vital for LVLM generalization. Extensive experiments demonstrate that COINCIDE achieves superior performance and data selection efficiency against 8 strong baselines on two distinct datasets: LLaVA-1.5 and Vision-Flan. Using only 20% of the LLaVA-1.5 dataset, COINCIDE achieves performance comparable to the LVLM finetuned on the whole dataset, with 70% reduction of the wall-clock running time. On the Vision-Flan dataset, our method achieves superior results with only 16.7% of the training data.
Paper Structure (41 sections, 8 equations, 13 figures, 6 tables, 1 algorithm)

This paper contains 41 sections, 8 equations, 13 figures, 6 tables, 1 algorithm.

Figures (13)

  • Figure 1: Different VL tasks in LLaVA-1.5 Liu2023llava15 exhibit different score distributions. Thus, selecting data based on a single score metric like EL2N Paul2021el2n or Self-Filter Chen2024selffilter results in a biased coreset (red), substantially decreasing the diversity within the coreset.
  • Figure 2: Different VL tasks (e.g., VQAv2 and GQA, LLaVA-Conv and LLaVA-Reason) share VL concept-skill compositions.
  • Figure 3: Illustration of COINCIDE. Our method utilizes a small LVLM to cluster visual instruction tuning data based on concept-skill compositions. We then assess the cluster transferability as the mean cosine similarity to other cluster centroids. We further compute the cluster density as the mean Gaussian kernel distance among all data pairs within the cluster. Using cluster transferability and density, COINCIDE determines the number of data to sample from each cluster and performs intra-cluster sampling. Finally, it combines all the selected samples from all the clusters to compose the final coreset.
  • Figure 4: Correlation between cluster centroid similarity and transferability. We examine the correlations in the LLaVA 1.5 Liu2023llava15 and Vision-Flan Xu2024visionflan datasets, with each point representing a source cluster. We report the Pearson correlation coefficient ($r$) and p-value.
  • Figure 5: Average relative performances of all coreset selection techniques at different sampling ratios for the LLaVA-1.5 dataset.
  • ...and 8 more figures