Table of Contents
Fetching ...

MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space

Yicheng Chen, Yining Li, Kai Hu, Zerun Ma, Haochen Ye, Kai Chen

TL;DR

This work addresses the challenge of selecting high-quality, diverse data for instruction tuning by proposing an information-based measure of semantic space using a label graph. It introduces MIG, a submodular, greedy sampler that propagates per-example quality across label relationships and balances quality with diversity through a concave, monotonically increasing transformation. Empirical results across multiple data pools and base models show MIG consistently outperforms strong baselines, with notable efficiency gains and even achieving full-dataset-level performance using only a subset. The approach offers a scalable, generalizable framework for dataset measurement and sampling that can enhance instruction-following capabilities with reduced data and compute requirements.

Abstract

Data quality and diversity are key to the construction of effective instruction-tuning datasets. % With the increasing availability of open-source instruction-tuning datasets, it is advantageous to automatically select high-quality and diverse subsets from a vast amount of data. % Existing methods typically prioritize instance quality and use heuristic rules to maintain diversity. % However, this absence of a comprehensive view of the entire collection often leads to suboptimal results. % Moreover, heuristic rules generally focus on distance or clustering within the embedding space, which fails to accurately capture the intent of complex instructions in the semantic space. % To bridge this gap, we propose a unified method for quantifying the information content of datasets. This method models the semantic space by constructing a label graph and quantifies diversity based on the distribution of information within the graph. % Based on such a measurement, we further introduce an efficient sampling method that selects data samples iteratively to \textbf{M}aximize the \textbf{I}nformation \textbf{G}ain (MIG) in semantic space. % Experiments on various datasets and base models demonstrate that MIG consistently outperforms state-of-the-art methods. % Notably, the model fine-tuned with 5\% Tulu3 data sampled by MIG achieves comparable performance to the official SFT model trained on the full dataset, with improvements of +5.73\% on AlpacaEval and +6.89\% on Wildbench.

MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space

TL;DR

This work addresses the challenge of selecting high-quality, diverse data for instruction tuning by proposing an information-based measure of semantic space using a label graph. It introduces MIG, a submodular, greedy sampler that propagates per-example quality across label relationships and balances quality with diversity through a concave, monotonically increasing transformation. Empirical results across multiple data pools and base models show MIG consistently outperforms strong baselines, with notable efficiency gains and even achieving full-dataset-level performance using only a subset. The approach offers a scalable, generalizable framework for dataset measurement and sampling that can enhance instruction-following capabilities with reduced data and compute requirements.

Abstract

Data quality and diversity are key to the construction of effective instruction-tuning datasets. % With the increasing availability of open-source instruction-tuning datasets, it is advantageous to automatically select high-quality and diverse subsets from a vast amount of data. % Existing methods typically prioritize instance quality and use heuristic rules to maintain diversity. % However, this absence of a comprehensive view of the entire collection often leads to suboptimal results. % Moreover, heuristic rules generally focus on distance or clustering within the embedding space, which fails to accurately capture the intent of complex instructions in the semantic space. % To bridge this gap, we propose a unified method for quantifying the information content of datasets. This method models the semantic space by constructing a label graph and quantifies diversity based on the distribution of information within the graph. % Based on such a measurement, we further introduce an efficient sampling method that selects data samples iteratively to \textbf{M}aximize the \textbf{I}nformation \textbf{G}ain (MIG) in semantic space. % Experiments on various datasets and base models demonstrate that MIG consistently outperforms state-of-the-art methods. % Notably, the model fine-tuned with 5\% Tulu3 data sampled by MIG achieves comparable performance to the official SFT model trained on the full dataset, with improvements of +5.73\% on AlpacaEval and +6.89\% on Wildbench.

Paper Structure

This paper contains 21 sections, 19 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison with different data selection methods lu2024instagliu2024what on the Tulu3 lambert2024tulu3 pool using Llama3.1-8B touvron2023llama, evaluated on (black) knowledge-based benchmarks and (red) human-preference benchmarks. See details in Sec. \ref{['sec:main-results']}.
  • Figure 2: Illustration of (a) Data Selection Pipeline and (b) MIG Sampler. Given the raw data pool, our pipeline first applies a tagger and scorer to annotate data. Next, MIG constructs the label graph based on the label set and iteratively selects the data point that maximizes the information gain within the graph. The selected data are used for supervised fine-tuning (SFT) of LLMs.
  • Figure 3: Data scaling experiments on Tulu3 using Llama3.1-8B. The score reported here is the $\text{Avg}$ score.
  • Figure 4: (a) Derivative of Information Score Functions. (b) $\text{Avg}_{\text{obj}}$ on Different Information Score Functions. (c) $\text{Avg}_{\text{sub}}$ on Different Quality Scores.
  • Figure 5: Quantitative results on different quality metrics. DEITA scores achieve the best performance on both human-preference and knowledge-based evaluations.
  • ...and 1 more figures

Theorems & Definitions (1)

  • proof