Table of Contents
Fetching ...

ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning

Changti Wu, Jiahuai Mao, Yuzhuo Miao, Shijie Lian, Bin Yu, Xiaopeng Lin, Cong Huang, Lei Zhang, Kai Chen

TL;DR

ScalSelect tackles the data inefficiency of large-scale Visual Instruction Tuning by proposing a training-free, linear-time data selection method. It constructs instruction-conditioned representations from the first transformer layer of the target VLM and selects samples via a global subspace preservation strategy using leverage scores. The approach avoids external proxies and pairwise comparisons, yet preserves the dominant structure of the full dataset, achieving over 97% of full-data performance with only 16% of the data and sometimes surpassing full-data training. Extensive experiments across models, datasets, and budgets demonstrate ScalSelect's robustness and practicality for scalable multimodal learning.

Abstract

Large-scale Visual Instruction Tuning (VIT) has become a key paradigm for advancing the performance of vision-language models (VLMs) across various multimodal tasks. However, training on the large-scale datasets is computationally expensive and inefficient due to redundancy in the data, which motivates the need for multimodal data selection to improve training efficiency. Existing data selection methods for VIT either require costly training or gradient computation. Training-free alternatives often depend on proxy models or datasets, instruction-agnostic representations, and pairwise similarity with quadratic complexity, limiting scalability and representation fidelity. In this work, we propose ScalSelect, a scalable training-free multimodal data selection method with linear-time complexity with respect to the number of samples, eliminating the need for external models or auxiliary datasets. ScalSelect first constructs sample representations by extracting visual features most attended by instruction tokens in the target VLM, capturing instruction-relevant information. It then identifies samples whose representations best approximate the dominant subspace of the full dataset representations, enabling scalable importance scoring without pairwise comparisons. Extensive experiments across multiple VLMs, datasets, and selection budgets demonstrate that ScalSelect achieves over 97.5% of the performance of training on the full dataset using only 16% of the data, and even outperforms full-data training in some settings. The code is available at \href{https://github.com/ChangtiWu/ScalSelect}{ScalSelect}.

ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning

TL;DR

ScalSelect tackles the data inefficiency of large-scale Visual Instruction Tuning by proposing a training-free, linear-time data selection method. It constructs instruction-conditioned representations from the first transformer layer of the target VLM and selects samples via a global subspace preservation strategy using leverage scores. The approach avoids external proxies and pairwise comparisons, yet preserves the dominant structure of the full dataset, achieving over 97% of full-data performance with only 16% of the data and sometimes surpassing full-data training. Extensive experiments across models, datasets, and budgets demonstrate ScalSelect's robustness and practicality for scalable multimodal learning.

Abstract

Large-scale Visual Instruction Tuning (VIT) has become a key paradigm for advancing the performance of vision-language models (VLMs) across various multimodal tasks. However, training on the large-scale datasets is computationally expensive and inefficient due to redundancy in the data, which motivates the need for multimodal data selection to improve training efficiency. Existing data selection methods for VIT either require costly training or gradient computation. Training-free alternatives often depend on proxy models or datasets, instruction-agnostic representations, and pairwise similarity with quadratic complexity, limiting scalability and representation fidelity. In this work, we propose ScalSelect, a scalable training-free multimodal data selection method with linear-time complexity with respect to the number of samples, eliminating the need for external models or auxiliary datasets. ScalSelect first constructs sample representations by extracting visual features most attended by instruction tokens in the target VLM, capturing instruction-relevant information. It then identifies samples whose representations best approximate the dominant subspace of the full dataset representations, enabling scalable importance scoring without pairwise comparisons. Extensive experiments across multiple VLMs, datasets, and selection budgets demonstrate that ScalSelect achieves over 97.5% of the performance of training on the full dataset using only 16% of the data, and even outperforms full-data training in some settings. The code is available at \href{https://github.com/ChangtiWu/ScalSelect}{ScalSelect}.
Paper Structure (21 sections, 12 equations, 2 figures, 9 tables)

This paper contains 21 sections, 12 equations, 2 figures, 9 tables.

Figures (2)

  • Figure 1: Overview of ScalSelect. (Left) For each sample, the target VLM extracts instruction-conditioned early representations by aggregating the visual tokens that are most attended by the user instruction tokens in the first layer of the LLM, yielding a compact sample representation. (Right) The representations of all samples are stacked into a representation matrix, from which ScalSelect identifies the dominant low-rank subspace of the full representation space (the space spanned by the full dataset representations). Each sample is scored according to its contribution to this dominant subspace, producing an importance score distribution from which a compact subset of informative samples is selected.
  • Figure 2: Distribution of Importance Scores. Left: Histogram of importance scores, exhibiting a highly skewed, long-tailed distribution. Right: Ranking curve of importance scores sorted in descending order, exhibiting a sharp initial drop followed by a long, gradually decaying tail.