Table of Contents
Fetching ...

IterSelectTune: An Iterative Training Framework for Efficient Instruction-Tuning Data Selection

Jielin Song, Siyu Liu, Bin Zhu, Yanghui Rao

TL;DR

By fine-tuning on approximately 20\% of the source data, this method consistently outperforms models fine-tuned on the full dataset across multiple benchmarks and public test datasets, highlighting the effectiveness of this approach in enhancing LLM performance while reducing the computational resources required for instruction tuning.

Abstract

As large language models (LLMs) continue to advance, instruction tuning has become critical for improving their ability to generate accurate and contextually appropriate responses. Although numerous instruction-tuning datasets have been developed to enhance LLM performance, selecting high-quality instruction data from large source datasets typically demands significant human effort. In this work, we introduce $\textbf{IterSelectTune}$, an efficient, cost-effective iterative training policy for selecting high-quality instruction data with no human involvement and limited reliance on GPT-4. By fine-tuning on approximately 20\% of the source data, our method consistently outperforms models fine-tuned on the full dataset across multiple benchmarks and public test datasets. These results highlight the effectiveness of our approach in enhancing LLM performance while reducing the computational resources required for instruction tuning.

IterSelectTune: An Iterative Training Framework for Efficient Instruction-Tuning Data Selection

TL;DR

By fine-tuning on approximately 20\% of the source data, this method consistently outperforms models fine-tuned on the full dataset across multiple benchmarks and public test datasets, highlighting the effectiveness of this approach in enhancing LLM performance while reducing the computational resources required for instruction tuning.

Abstract

As large language models (LLMs) continue to advance, instruction tuning has become critical for improving their ability to generate accurate and contextually appropriate responses. Although numerous instruction-tuning datasets have been developed to enhance LLM performance, selecting high-quality instruction data from large source datasets typically demands significant human effort. In this work, we introduce , an efficient, cost-effective iterative training policy for selecting high-quality instruction data with no human involvement and limited reliance on GPT-4. By fine-tuning on approximately 20\% of the source data, our method consistently outperforms models fine-tuned on the full dataset across multiple benchmarks and public test datasets. These results highlight the effectiveness of our approach in enhancing LLM performance while reducing the computational resources required for instruction tuning.

Paper Structure

This paper contains 37 sections, 5 equations, 8 figures, 9 tables, 2 algorithms.

Figures (8)

  • Figure 1: Illustration of our framework. We first apply K-Means clustering to the source set $\mathcal{S}$ to derive the diversity subset $\mathcal{V}$. Subsequently, we compute model scores and similarity scores for $\mathcal{X}_v$, followed by sorting and selecting a batch $\mathcal{D}$. 1) In the iterative training phase, we input $\mathcal{X}_\mathcal{D}$ into the LLM to generate responses $\hat{\mathcal{Y}}_\mathcal{D}$. GPT-4 then evaluates $\hat{\mathcal{Y}}_\mathcal{D}$ and $\mathcal{Y}_\mathcal{D}$ for binary classification. The resulting binary-classified dataset is employed to train the classifier model, enabling it to assess the quality of instructions. 2) During the inference phase, after obtaining batch $\mathcal{D}$ through score sorting, we directly incorporate it into the hard dataset $\mathcal{D}^\mathcal{H}$.
  • Figure 2: Winning Score vs. Training Data Size: Performance comparison across different test sets (top) and total performance (bottom).
  • Figure 3: Comparison of Win/Tie/Lose for models fine-tuned on 10% (top) and 20% (bottom) of the data, with the full-data fine-tuned model.
  • Figure 4: Score visualization across multiple categories on MT-Bench.
  • Figure 5: Comparison of the number of "hard" instructions identified across iterations for different $\alpha$. Results shown up to iteration 3.
  • ...and 3 more figures