Table of Contents
Fetching ...

Cream of the Crop: Harvesting Rich, Scalable and Transferable Multi-Modal Data for Instruction Fine-Tuning

Mengyao Lyu, Yan Li, Huasong Zhong, Wenhao Yang, Hui Chen, Jungong Han, Guiguang Ding, Zhenheng Yang

TL;DR

This work tackles data selection for instruction tuning of multi-modal LLMs by introducing mmSSR, a pipeline that decomposes data valuation into 14 vision-language capabilities and a multimodal styler to ensure diversity. Scorers and a lightweight style model are trained via GPT-4o judgments, enabling efficient, scalable scoring and sample selection without heavy clustering. Across 14 benchmarks and multiple budgets, mmSSR consistently outperforms baselines and demonstrates strong transferability to new domains and model architectures, achieving near-full data performance with a fraction of the data (e.g., 99.1% of full performance with 30% of 2.6M data). The approach offers practical impact for building robust open-source MLLMs by providing customizable, scalable data curation that improves generalization and domain adaptability.

Abstract

The hypothesis that pretrained large language models (LLMs) necessitate only minimal supervision during the fine-tuning (SFT) stage (Zhou et al., 2024) has been substantiated by recent advancements in data curation and selection research. However, their stability and generalizability are compromised due to the vulnerability to experimental setups and validation protocols, falling short of surpassing random sampling (Diddee & Ippolito, 2024; Xia et al., 2024b). Built upon LLMs, multi-modal LLMs (MLLMs), combined with the sheer token volume and heightened heterogeneity of data sources, amplify both the significance and complexity of data selection. To harvest multi-modal instructional data in a robust and efficient manner, we re-define the granularity of the quality metric by decomposing it into 14 vision-language-related capabilities, and introduce multi-modal rich scorers to evaluate the capabilities of each data candidate. To promote diversity, in light of the inherent objective of the alignment stage, we take interaction style as diversity indicator and use a multi-modal rich styler to identify data instruction patterns. In doing so, our multi-modal rich scorers and styler (mmSSR) guarantee that high-scoring information is conveyed to users in diversified forms. Free from embedding-based clustering or greedy sampling, mmSSR efficiently scales to millions of data with varying budget constraints, supports customization for general or specific capability acquisition, and facilitates training-free generalization to new domains for curation. Across 10+ experimental settings, validated by 14 multi-modal benchmarks, we demonstrate consistent improvements over random sampling, baseline strategies and state-of-the-art selection methods, achieving 99.1% of full performance with only 30% of the 2.6M data.

Cream of the Crop: Harvesting Rich, Scalable and Transferable Multi-Modal Data for Instruction Fine-Tuning

TL;DR

This work tackles data selection for instruction tuning of multi-modal LLMs by introducing mmSSR, a pipeline that decomposes data valuation into 14 vision-language capabilities and a multimodal styler to ensure diversity. Scorers and a lightweight style model are trained via GPT-4o judgments, enabling efficient, scalable scoring and sample selection without heavy clustering. Across 14 benchmarks and multiple budgets, mmSSR consistently outperforms baselines and demonstrates strong transferability to new domains and model architectures, achieving near-full data performance with a fraction of the data (e.g., 99.1% of full performance with 30% of 2.6M data). The approach offers practical impact for building robust open-source MLLMs by providing customizable, scalable data curation that improves generalization and domain adaptability.

Abstract

The hypothesis that pretrained large language models (LLMs) necessitate only minimal supervision during the fine-tuning (SFT) stage (Zhou et al., 2024) has been substantiated by recent advancements in data curation and selection research. However, their stability and generalizability are compromised due to the vulnerability to experimental setups and validation protocols, falling short of surpassing random sampling (Diddee & Ippolito, 2024; Xia et al., 2024b). Built upon LLMs, multi-modal LLMs (MLLMs), combined with the sheer token volume and heightened heterogeneity of data sources, amplify both the significance and complexity of data selection. To harvest multi-modal instructional data in a robust and efficient manner, we re-define the granularity of the quality metric by decomposing it into 14 vision-language-related capabilities, and introduce multi-modal rich scorers to evaluate the capabilities of each data candidate. To promote diversity, in light of the inherent objective of the alignment stage, we take interaction style as diversity indicator and use a multi-modal rich styler to identify data instruction patterns. In doing so, our multi-modal rich scorers and styler (mmSSR) guarantee that high-scoring information is conveyed to users in diversified forms. Free from embedding-based clustering or greedy sampling, mmSSR efficiently scales to millions of data with varying budget constraints, supports customization for general or specific capability acquisition, and facilitates training-free generalization to new domains for curation. Across 10+ experimental settings, validated by 14 multi-modal benchmarks, we demonstrate consistent improvements over random sampling, baseline strategies and state-of-the-art selection methods, achieving 99.1% of full performance with only 30% of the 2.6M data.

Paper Structure

This paper contains 38 sections, 1 equation, 26 figures, 3 tables.

Figures (26)

  • Figure 1: Our proposed mmSSR against the random sampling baseline (3 trials) across both general and specialized multi-modal benchmarks under the 10% (L) and 30% (R) data budgets.
  • Figure 2: Pipeline of the proposed multi-modal data selection method. We decompose the VL capabilities required by MLLMs and refer to GPT-4o's judgments of the rich capabilities on a scale from 0 to 5, while meantime prompting the identification of the user-model interaction style. The small amount of derived sample-scores-styles triplets is employed to instruct the pretrained task model for multi-modal rich scorers and styler, i.e., our mmSSR. It facilitates the analysis and sampling of candidate data points at the scale of millions, ensuring a subset that is both high-quality and diverse, while maintaining minimal time and resource expenditure. The fine-grained mmSSR can also directly generalizes to other data domains, and support efficient scaling in data quantity and capabilities.
  • Figure 3: Transferability of mmSSR models: trained on Share-GPT4v data, directly inferences on large-scale LLaVA-OVSI pool.
  • Figure 4: Transferability of mmSSR subsets: selected by mmSSR-7B, directly used to train a LLaVA-OVSI-0.5B variant.
  • Figure 5: Results of scaling in data quantity (1% $\rightarrow$ 50%) and in data capability (basic 13 capabilities in mmSSR + new OCR samples).
  • ...and 21 more figures