Table of Contents
Fetching ...

Diversity Measurement and Subset Selection for Instruction Tuning Datasets

Peiqi Wang, Yikang Shen, Zhen Guo, Matthew Stallone, Yoon Kim, Polina Golland, Rameswar Panda

TL;DR

This work introduces a deterministic point process framework to select instruction tuning data subsets with a focus on diversity and data quality. It defines a log determinant distance (LDD) to quantify dataset diversity and demonstrates that LDD correlates with downstream instruction-following performance when using weight-gradient representations projected via Johnson-Lindenstrauss transforms. The approach combines a kernelized DPP with a greedy MAP inference to produce scalable subset selection and provides practical guidance on when diversity or quality should dominate the data budget. Empirically, the method shows meaningful gains on several instruction tuning datasets and offers insights into curation strategies, including how more diverse sources and higher-quality prompts affect model behavior. The results support a data-efficient pathway for instruction tuning and supply a framework for comparing and analyzing instruction datasets and preference data in NLP systems.

Abstract

We aim to select data subsets for the fine-tuning of large language models to more effectively follow instructions. Prior work has emphasized the importance of diversity in dataset curation but relied on heuristics such as the number of tasks. In this paper, we use determinantal point processes to capture the diversity and quality of instruction tuning datasets for subset selection. We propose to measure dataset diversity with log determinant distance that is the distance between the dataset of interest and a maximally diverse reference dataset. Our experiments demonstrate that the proposed diversity measure in the normalized weight gradient space is correlated with downstream instruction-following performance. Consequently, it can be used to inform when data selection is the most helpful and to analyze dataset curation strategies. We demonstrate the utility of our approach on various instruction tuning datasets.

Diversity Measurement and Subset Selection for Instruction Tuning Datasets

TL;DR

This work introduces a deterministic point process framework to select instruction tuning data subsets with a focus on diversity and data quality. It defines a log determinant distance (LDD) to quantify dataset diversity and demonstrates that LDD correlates with downstream instruction-following performance when using weight-gradient representations projected via Johnson-Lindenstrauss transforms. The approach combines a kernelized DPP with a greedy MAP inference to produce scalable subset selection and provides practical guidance on when diversity or quality should dominate the data budget. Empirically, the method shows meaningful gains on several instruction tuning datasets and offers insights into curation strategies, including how more diverse sources and higher-quality prompts affect model behavior. The results support a data-efficient pathway for instruction tuning and supply a framework for comparing and analyzing instruction datasets and preference data in NLP systems.

Abstract

We aim to select data subsets for the fine-tuning of large language models to more effectively follow instructions. Prior work has emphasized the importance of diversity in dataset curation but relied on heuristics such as the number of tasks. In this paper, we use determinantal point processes to capture the diversity and quality of instruction tuning datasets for subset selection. We propose to measure dataset diversity with log determinant distance that is the distance between the dataset of interest and a maximally diverse reference dataset. Our experiments demonstrate that the proposed diversity measure in the normalized weight gradient space is correlated with downstream instruction-following performance. Consequently, it can be used to inform when data selection is the most helpful and to analyze dataset curation strategies. We demonstrate the utility of our approach on various instruction tuning datasets.
Paper Structure (24 sections, 1 theorem, 15 equations, 6 figures, 2 tables)

This paper contains 24 sections, 1 theorem, 15 equations, 6 figures, 2 tables.

Key Result

Lemma 3.1

Let $\epsilon,\delta>0$. If $r = \mathcal{O}(\log(1/\delta)/\epsilon^2)$, then with probability at least $1-\delta$.

Figures (6)

  • Figure 1: Step-by-step demonstration to compute the log determinant distance on a set of instruction tuning datasets of varying diversity. The marginal gain curve $\Delta_n(L)$ is derived from the greedy MAP algorithm for DPPs (1st figure). $\log\det(L_n)$ is the cumulative marginal gain curve (2nd figure). Note, scaling the kernel matrix $L$ by a constant $c>0$ shifts $\log\det(L_n)$ linearly by $n\log(c)$, complicating result interpretation. Moreover, $\log\det(L_n)$ is heavily influenced by dataset size, e.g., Dolly contains 15k examples and has a much larger $\log\det(L)$ compared to that of the other datasets subsampled to include roughly 50k examples despite it having a comparatively smaller $\log\det(L_{\text{15k}})$. To address these challenges, we compute the difference between the log determinant for a maximally diverse "reference" dataset and each dataset of interest (3rd figure) and then divide by dataset size (4th figure). The log determinant distance for a dataset is the value of the corresponding curve $(1/n)\log(\det(R_n)/\det(L_n))$ at the last iteration.
  • Figure 2: The log determinant distance in \ref{['eq:ldd_definition']} of instruction tuning datasets is correlated with instruction following performance when the model is finetuned on these datasets, with a Pearson correlation of $\rho_{p}=-0.85$ and a Spearman's rank correlation of $\rho_{s}=-0.85$.
  • Figure 3: Studies of the log determinant distance as a measure of diversity of instruction tuning and preference learning datasets. More diverse datasets yield a log determinant distance curve that is closer to a horizontal line and closer to zero. Distilling responses from capable large language models improves diversity (1st panel). The diversity of synthetic datasets generated using Self-Instruct wangSelfInstructAligningLanguage2023 increase with better teacher model (2nd panel). Using LLMs to re-write instructions to be more complex xuWizardLMEmpoweringLarge2024 also improves diversity (3rd panel). Curating instructions from diverse sources like ShareGPT & OASST2 yields consistently higher average marginal gains compared to those curated with less human involvement (4th panel). Preference datasets overall are a lot more diverse than instruction tuning datasets, some with no apparent drop off in average marginal gains (5th panel).
  • Figure 4: Comparison of the log determinant distance computed using different data representations: MpNet Emb (top row), Llama Emb (middle row), and Llama $\nabla_{\theta}\ell$ with respect to instruction tuning loss (bottom row). The log determinant distance based on weight gradient provides the strongest correlation with instruction following performance.
  • Figure 5: Vary $\lambda$ interpolates between enforcing diversity and selecting for quality. Here we use the length adjusted win rate metric computed from AlpacaEval's win rate divide by the average length of the model's generations, then multiply by that of the reference model's generations.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Lemma 3.1