Table of Contents
Fetching ...

Filter Images First, Generate Instructions Later: Pre-Instruction Data Selection for Visual Instruction Tuning

Bardia Safaei, Faizan Siddiqui, Jiacong Xu, Vishal M. Patel, Shao-Yuan Lo

TL;DR

Pre-Instruction Data Selection (PreSel), a more practical data selection paradigm that directly selects the most beneficial unlabeled images and generates instructions only for the selected images, achieves performance comparable to full-data VIT on the LLaVA-1.5 and Vision-Flan datasets.

Abstract

Visual instruction tuning (VIT) for large vision-language models (LVLMs) requires training on expansive datasets of image-instruction pairs, which can be costly. Recent efforts in VIT data selection aim to select a small subset of high-quality image-instruction pairs, reducing VIT runtime while maintaining performance comparable to full-scale training. However, a major challenge often overlooked is that generating instructions from unlabeled images for VIT is highly expensive. Most existing VIT datasets rely heavily on human annotations or paid services like the GPT API, which limits users with constrained resources from creating VIT datasets for custom applications. To address this, we introduce Pre-Instruction Data Selection (PreSel), a more practical data selection paradigm that directly selects the most beneficial unlabeled images and generates instructions only for the selected images. PreSel first estimates the relative importance of each vision task within VIT datasets to derive task-wise sampling budgets. It then clusters image features within each task, selecting the most representative images with the budget. This approach reduces computational overhead for both instruction generation during VIT data formation and LVLM fine-tuning. By generating instructions for only 15% of the images, PreSel achieves performance comparable to full-data VIT on the LLaVA-1.5 and Vision-Flan datasets. The link to our project page: https://bardisafa.github.io/PreSel

Filter Images First, Generate Instructions Later: Pre-Instruction Data Selection for Visual Instruction Tuning

TL;DR

Pre-Instruction Data Selection (PreSel), a more practical data selection paradigm that directly selects the most beneficial unlabeled images and generates instructions only for the selected images, achieves performance comparable to full-data VIT on the LLaVA-1.5 and Vision-Flan datasets.

Abstract

Visual instruction tuning (VIT) for large vision-language models (LVLMs) requires training on expansive datasets of image-instruction pairs, which can be costly. Recent efforts in VIT data selection aim to select a small subset of high-quality image-instruction pairs, reducing VIT runtime while maintaining performance comparable to full-scale training. However, a major challenge often overlooked is that generating instructions from unlabeled images for VIT is highly expensive. Most existing VIT datasets rely heavily on human annotations or paid services like the GPT API, which limits users with constrained resources from creating VIT datasets for custom applications. To address this, we introduce Pre-Instruction Data Selection (PreSel), a more practical data selection paradigm that directly selects the most beneficial unlabeled images and generates instructions only for the selected images. PreSel first estimates the relative importance of each vision task within VIT datasets to derive task-wise sampling budgets. It then clusters image features within each task, selecting the most representative images with the budget. This approach reduces computational overhead for both instruction generation during VIT data formation and LVLM fine-tuning. By generating instructions for only 15% of the images, PreSel achieves performance comparable to full-data VIT on the LLaVA-1.5 and Vision-Flan datasets. The link to our project page: https://bardisafa.github.io/PreSel

Paper Structure

This paper contains 20 sections, 8 equations, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: Top: Existing VIT data selection methods assume access to well-prepared VIT datasets in which all the images are already annotated with instructions by costly resources, such as GPT API and human labor. These methods require information on both images and their instructions. Bottom: Our approach performs selection directly on unlabeled images and then utilizes resources to generate instructions exclusively for the selected images. Hence, we not only enable faster fine-tuning but also significantly reduce instruction generation costs (e.g., 15%).
  • Figure 2: An illustration of $Q$, $R$, and $I$ in a VIT sample. Instruction $Y = \{Q, R\}$.
  • Figure 3: We propose $\texttt{PreSel}$, an efficient Pre-Instruction Data Selection approach for Visual Instruction Tuning (VIT). Given a large pool of unlabeled images $D$ from various tasks, $\texttt{PreSel}$ first estimates the importance of each task $T_i$ via a randomly selected small reference set $\mathcal{D}_{ref}$ with instructions generated. Each instruction ($Y$) is split into questions ($Q$) and responses ($R$) to compute the Instruction Relevance Score (IRS), which determines task proportions $w(T_i)$ in the final selected subset $\mathcal{D}_S$. Given the derived task proportions, it then uses the DINOv2 vision encoder to extract features from the remaining unlabeled images, perform clustering within each task, and select representative images using the Neighbor Centrality (NC) score. The collection of selected images from all tasks is assembled as $\mathcal{D}_S$.
  • Figure 4: Average relative performance of data selection methods on LLaVA-1.5 at different sampling ratios.
  • Figure 5: A demonstration of task proportions for the LLaVA-1.5 dataset assigned by PreSel and Size-Balanced sampling.