Concept-skill Transferability-based Data Selection for Large Vision-Language Models
Jaewoo Lee, Boyang Li, Sung Ju Hwang
TL;DR
COINCIDE tackles the high data and compute burden of visual instruction tuning for Large Vision-Language Models by clustering VIT data with activations from a small reference model to uncover concept-skill compositions, then selecting data per cluster based on transferability proxy and density. Data from transferable, lower-density clusters are sampled more heavily (with $P_i \\propto \exp\left( S_i / (\\tau D_i) \right)$) while intra-cluster sampling minimizes $\text{MMD}^2$ to preserve distributional similarity. Across LLaVA-1.5 and Vision-Flan, COINCIDE achieves comparable or superior performance using only $16.7$–$20\%$ of the data and substantially reduces wall-clock training time, demonstrating strong generalization due to diversified concept-skill coverage and transferability-aware sampling. The approach highlights the importance of data quality and diversity for LVLM generalization and offers a scalable, model-light data selection paradigm applicable to large multimodal datasets.
Abstract
Instruction tuning, or supervised finetuning on extensive task-specific data, is necessary for Large Vision-Language Models (LVLMs) to generalize well across a broad range of vision-language (VL) tasks. However, training on large VL datasets can become prohibitively expensive. In this work, we introduce COINCIDE, an effective and scalable data selection technique that uses a small model as a reference model to select visual instruction tuning data for efficient finetuning of a target LVLM, focusing on diversity and transferability. Specifically, we cluster the training data using internal activations from a small model, which identifies VL concept-skill compositions needed by a target LVLM. We then sample data from these diverse clusters by considering their density and transferability, or the ability to transfer well to other concept-skill compositions. This approach ensures the diversity of these compositions, which is vital for LVLM generalization. Extensive experiments demonstrate that COINCIDE achieves superior performance and data selection efficiency against 8 strong baselines on two distinct datasets: LLaVA-1.5 and Vision-Flan. Using only 20% of the LLaVA-1.5 dataset, COINCIDE achieves performance comparable to the LVLM finetuned on the whole dataset, with 70% reduction of the wall-clock running time. On the Vision-Flan dataset, our method achieves superior results with only 16.7% of the training data.
