Table of Contents
Fetching ...

Data Diversity Matters for Robust Instruction Tuning

Alexander Bukharin, Shiyang Li, Zhengyang Wang, Jingfeng Yang, Bing Yin, Xian Li, Chao Zhang, Tuo Zhao, Haoming Jiang

TL;DR

This paper tackles the challenge of automatically curating high-quality and diverse instruction tuning data. It introduces Quality-Diversity Instruction Tuning (QDIT), which jointly optimizes a diversity measure based on a facility location function and a quality score from LLMs or reward models via a greedy selection strategy. Key findings reveal a tradeoff between quality and diversity and demonstrate that increasing diversity substantially enhances worst-case robustness while preserving average performance. Across five large-scale datasets, QDIT consistently improves both average and worst-case instruction following, offering a practical approach to robust instruction tuning with scalable data selection.

Abstract

Recent works have shown that by curating high quality and diverse instruction tuning datasets, we can significantly improve instruction-following capabilities. However, creating such datasets is difficult and most works rely on manual curation or proprietary language models. Automatic data curation is difficult as it is still not clear how we can define diversity for instruction tuning, how diversity and quality depend on one other, and how we can optimize dataset quality and diversity. To resolve these issue, we propose a new algorithm, Quality-Diversity Instruction Tuning (QDIT). QDIT provides a simple method to simultaneously control dataset diversity and quality, allowing us to conduct an in-depth study on the effect of diversity and quality on instruction tuning performance. From this study we draw two key insights (1) there is a natural tradeoff between data diversity and quality and (2) increasing data diversity significantly improves the worst case instruction following performance, therefore improving robustness. We validate the performance of QDIT on several large scale instruction tuning datasets, where we find it can substantially improve worst and average case performance compared to quality-driven data selection.

Data Diversity Matters for Robust Instruction Tuning

TL;DR

This paper tackles the challenge of automatically curating high-quality and diverse instruction tuning data. It introduces Quality-Diversity Instruction Tuning (QDIT), which jointly optimizes a diversity measure based on a facility location function and a quality score from LLMs or reward models via a greedy selection strategy. Key findings reveal a tradeoff between quality and diversity and demonstrate that increasing diversity substantially enhances worst-case robustness while preserving average performance. Across five large-scale datasets, QDIT consistently improves both average and worst-case instruction following, offering a practical approach to robust instruction tuning with scalable data selection.

Abstract

Recent works have shown that by curating high quality and diverse instruction tuning datasets, we can significantly improve instruction-following capabilities. However, creating such datasets is difficult and most works rely on manual curation or proprietary language models. Automatic data curation is difficult as it is still not clear how we can define diversity for instruction tuning, how diversity and quality depend on one other, and how we can optimize dataset quality and diversity. To resolve these issue, we propose a new algorithm, Quality-Diversity Instruction Tuning (QDIT). QDIT provides a simple method to simultaneously control dataset diversity and quality, allowing us to conduct an in-depth study on the effect of diversity and quality on instruction tuning performance. From this study we draw two key insights (1) there is a natural tradeoff between data diversity and quality and (2) increasing data diversity significantly improves the worst case instruction following performance, therefore improving robustness. We validate the performance of QDIT on several large scale instruction tuning datasets, where we find it can substantially improve worst and average case performance compared to quality-driven data selection.
Paper Structure (19 sections, 2 equations, 18 figures, 5 tables, 1 algorithm)

This paper contains 19 sections, 2 equations, 18 figures, 5 tables, 1 algorithm.

Figures (18)

  • Figure 1: Distribution of root verbs and first nouns selected by different algorithms. The dataset size is 3000.
  • Figure 2: Effect of $\alpha$ on QDIT's dataset quality and diversity. The red line represents a randomly selected dataset.
  • Figure 3: Percent improvement in HH Score of QDIT over Quality-based selection. Performance is averaged over the five datasets and the shaded area represents one standard deviation.
  • Figure 4: Effect of $\alpha$ on best case and worst case performance. The red line represents a randomly selected dataset and $\alpha=1.0$ is quality-driven data selection. Worst HH Score refers to the bottom 10 percent of HH scores.
  • Figure 5: The selection strategy of the various QDIT algorithms on an example dataset. Six data points are selected and the selected data points are circled.
  • ...and 13 more figures