Table of Contents
Fetching ...

Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement

Simon Yu, Liangyu Chen, Sara Ahmadian, Marzieh Fadaee

TL;DR

This work tackles the problem of selecting an optimal, diverse subset of instruction data for finetuning large language models. It introduces a diversity-centric framework combining k-means clustering with an iterative refinement loop that leverages early training signals to reweight clusters and resample data. The static k-means-quality (kMQ) method and the iterative variant consistently outperform random sampling and previous data-selection baselines, with the iterative approach achieving the strongest gains across multiple tasks and models. The study demonstrates the practical value of diversity-first data selection for instruction tuning, offering guidelines for cluster count and scoring methods, and provides code to facilitate reproducibility and further research.

Abstract

Finetuning large language models on instruction data is crucial for enhancing pre-trained knowledge and improving instruction-following capabilities. As instruction datasets proliferate, selecting optimal data for effective training becomes increasingly important. This work addresses the question: How can we determine the optimal subset of data for effective training? While existing research often emphasizes local criteria like instance quality for subset selection, we argue that a global approach focused on data diversity is more critical. Our method employs k-means clustering to ensure the selected subset effectively represents the full dataset. We propose an iterative refinement method inspired by active learning techniques to resample instances from clusters, reassessing each cluster's importance and sampling weight in every training iteration. This approach reduces the effect of outliers and automatically filters out clusters containing low-quality data. Through extensive evaluation across natural language reasoning, general world knowledge, code and math reasoning tasks, and by fine-tuning models from various families, we observe consistent improvements, achieving a 7% increase over random selection and a 3.8% improvement over state-of-the-art sampling methods. Our work highlights the significance of diversity-first sampling when finetuning LLMs to enhance performance across a broad array of evaluation tasks. Our code is available at https://github.com/for-ai/iterative-data-selection.

Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement

TL;DR

This work tackles the problem of selecting an optimal, diverse subset of instruction data for finetuning large language models. It introduces a diversity-centric framework combining k-means clustering with an iterative refinement loop that leverages early training signals to reweight clusters and resample data. The static k-means-quality (kMQ) method and the iterative variant consistently outperform random sampling and previous data-selection baselines, with the iterative approach achieving the strongest gains across multiple tasks and models. The study demonstrates the practical value of diversity-first data selection for instruction tuning, offering guidelines for cluster count and scoring methods, and provides code to facilitate reproducibility and further research.

Abstract

Finetuning large language models on instruction data is crucial for enhancing pre-trained knowledge and improving instruction-following capabilities. As instruction datasets proliferate, selecting optimal data for effective training becomes increasingly important. This work addresses the question: How can we determine the optimal subset of data for effective training? While existing research often emphasizes local criteria like instance quality for subset selection, we argue that a global approach focused on data diversity is more critical. Our method employs k-means clustering to ensure the selected subset effectively represents the full dataset. We propose an iterative refinement method inspired by active learning techniques to resample instances from clusters, reassessing each cluster's importance and sampling weight in every training iteration. This approach reduces the effect of outliers and automatically filters out clusters containing low-quality data. Through extensive evaluation across natural language reasoning, general world knowledge, code and math reasoning tasks, and by fine-tuning models from various families, we observe consistent improvements, achieving a 7% increase over random selection and a 3.8% improvement over state-of-the-art sampling methods. Our work highlights the significance of diversity-first sampling when finetuning LLMs to enhance performance across a broad array of evaluation tasks. Our code is available at https://github.com/for-ai/iterative-data-selection.
Paper Structure (22 sections, 8 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 22 sections, 8 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: Our proposed clustering ($k$MQ) and two sampling methods: We visualize our static data selection with $k$MQ, as proposed \ref{['sec:static']} and the iterative data selection pipeline where we refine the selection criteria and resample new instances in each iteration, as proposed in \ref{['sec:iterative_method']}.
  • Figure 2: Comparison of iterative selection approach using different sample-scoring methods: perplexity, GPT-4, reward model. Note that both random and $k$MQ selection methods use 10% of data and train for three epochs. The iterative feedback runs are performed with the same budget at iteration 3, ensuring a fair comparison. Iterative sampling using a reward model achieves the best performance.
  • Figure 3: Average performance on downstream tasks (bar plots) for different number of clusters $k$. There is a correlation between downstream performance and both Silhouette and Elbow scores. The silhouette score is an efficient and effective proxy to estimate the number of clusters eliminating the need to explore the hyperparameter space.
  • Figure 4: The percentage of clusters with an aggregated quality score below the threshold of 0.3.
  • Figure 5: Impact of using different embedding models to cluster prompts. The Silhouette score consistently predicts the overall cluster quality with different embedding models.
  • ...and 1 more figures