Table of Contents
Fetching ...

Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning

Shengguang Wu, Keming Lu, Benfeng Xu, Junyang Lin, Qi Su, Chang Zhou

TL;DR

The paper addresses the data- and cost-intensive nature of instruction tuning for large language models by introducing DiverseEvol, a self-evolving, diversity-driven sampling method. It uses an iterative framework and K-Center-Sampling in the model’s embedding space to select highly diverse subsets of instructions, enabling substantial data reduction (under 8%) without sacrificing performance. Empirical results across multiple open-source instruction-tuning datasets and benchmarks demonstrate that DiverseEvol can match or surpass full-data baselines, with analyses highlighting the critical roles of data diversity and iterative refinement. The work provides code and shows promising practical impact for efficient, scalable instruction tuning.

Abstract

Enhancing the instruction-following ability of Large Language Models (LLMs) primarily demands substantial instruction-tuning datasets. However, the sheer volume of these imposes a considerable computational burden and annotation cost. To investigate a label-efficient instruction tuning method that allows the model itself to actively sample subsets that are equally or even more effective, we introduce a self-evolving mechanism DiverseEvol. In this process, a model iteratively augments its training subset to refine its own performance, without requiring any intervention from humans or more advanced LLMs. The key to our data sampling technique lies in the enhancement of diversity in the chosen subsets, as the model selects new data points most distinct from any existing ones according to its current embedding space. Extensive experiments across three datasets and benchmarks demonstrate the effectiveness of DiverseEvol. Our models, trained on less than 8% of the original dataset, maintain or improve performance compared with finetuning on full data. We also provide empirical evidence to analyze the importance of diversity in instruction data and the iterative scheme as opposed to one-time sampling. Our code is publicly available at https://github.com/OFA-Sys/DiverseEvol.git.

Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning

TL;DR

The paper addresses the data- and cost-intensive nature of instruction tuning for large language models by introducing DiverseEvol, a self-evolving, diversity-driven sampling method. It uses an iterative framework and K-Center-Sampling in the model’s embedding space to select highly diverse subsets of instructions, enabling substantial data reduction (under 8%) without sacrificing performance. Empirical results across multiple open-source instruction-tuning datasets and benchmarks demonstrate that DiverseEvol can match or surpass full-data baselines, with analyses highlighting the critical roles of data diversity and iterative refinement. The work provides code and shows promising practical impact for efficient, scalable instruction tuning.

Abstract

Enhancing the instruction-following ability of Large Language Models (LLMs) primarily demands substantial instruction-tuning datasets. However, the sheer volume of these imposes a considerable computational burden and annotation cost. To investigate a label-efficient instruction tuning method that allows the model itself to actively sample subsets that are equally or even more effective, we introduce a self-evolving mechanism DiverseEvol. In this process, a model iteratively augments its training subset to refine its own performance, without requiring any intervention from humans or more advanced LLMs. The key to our data sampling technique lies in the enhancement of diversity in the chosen subsets, as the model selects new data points most distinct from any existing ones according to its current embedding space. Extensive experiments across three datasets and benchmarks demonstrate the effectiveness of DiverseEvol. Our models, trained on less than 8% of the original dataset, maintain or improve performance compared with finetuning on full data. We also provide empirical evidence to analyze the importance of diversity in instruction data and the iterative scheme as opposed to one-time sampling. Our code is publicly available at https://github.com/OFA-Sys/DiverseEvol.git.
Paper Structure (10 sections, 2 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 10 sections, 2 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of our iterative DiverseEvol: Starting with an initial training data pool $P_0$ and the remaining data $Q_0$ from the source dataset, we train a chat model $M_0$ and project all datapoints into its embedding space $EMB_0$. Leverage K-Center based selection \ref{['sec:k_center_sampling']} in this embedding space, a new set of datapoints $S_0$ is chosen from $Q_0$ and added to the next training data pool $P_1$ to instrution-tune the next chat model $M_1$. This process is repeated for $T$ steps, producing progressively augmented training data pool based solely on the model itself, which is then used to improve a more refined model with improved capabilities.
  • Figure 2: Performance evolution of chat models across various source datasets using our proposed K-Center based DiverseEvol and alternative sampling approaches. The Y-axis represents relative scores (RS) with respect to ChatGPT, while the X-axis indicates the number of training samples. The curves demonstrate the rapid proficiency gains achieved by the DiverseEvol approach, matching or often outpacing strong baselines (*Full Data) trained on the full dataset with only a significantly small fraction of the data.
  • Figure 3: Diversity evolution in the selected training data pool from three source datasets. The Y-axis denotes the Vendi-Score for measuring diversity, and the X-axis shows increasing data size. The gray line (*Full Data) represents original source dataset diversity. The contrasting curves highlight our K-center approach's early and sustained enhancement of data diversity.
  • Figure 4: Performance of instruction-tuned chat models in relation to Vendi-Score of their training datasets, illustrating the influence of data diversity. The three distinct curves correspond to training data volumes of $300$, $700$, and $1100$. A consistent trend of performance enhancement is observed with increased dataset diversity across most benchmarks, with only minor deviations seen on the Wizardlm-Bench.