Table of Contents
Fetching ...

Boosting LLM via Learning from Data Iteratively and Selectively

Qi Jia, Siyu Ren, Ziheng Qin, Fuzhao Xue, Jinjie Ni, Yang You

TL;DR

This work tackles data quality issues in instruction tuning arising from multi-source data with noise and duplication. It introduces IterIT, a data-selection framework that jointly optimizes a model-specific complexity score $S_{ m COM}^{i,\theta}$ and a response-diversity score $S_{ m DIV}^i$ to iteratively curate high-value instruction-response pairs, updating complexity across epochs. Through extensive experiments on diverse datasets and backbones, IterIT achieves strong average performance and robust generalization, while ablations confirm the value of iterative selection and diversity-aware scoring. The results suggest that model-data collaboration during post-training can yield substantial gains with only a modest data budget, enabling efficient instruction tuning in real-world settings.

Abstract

Datasets nowadays are generally constructed from multiple sources and using different synthetic techniques, making data de-noising and de-duplication crucial before being used for post-training. In this work, we propose to perform instruction tuning by iterative data selection (\ApproachName{}). We measure the quality of a sample from complexity and diversity simultaneously. Instead of calculating the complexity score once for all before fine-tuning, we highlight the importance of updating this model-specific score during fine-tuning to accurately accommodate the dynamic changes of the model. On the other hand, the diversity score is defined on top of the samples' responses under the consideration of their informativeness. IterIT integrates the strengths of both worlds by iteratively updating the complexity score for the top-ranked samples and greedily selecting the ones with the highest complexity-diversity score. Experiments on multiple instruction-tuning data demonstrate consistent improvements of IterIT over strong baselines. Moreover, our approach also generalizes well to domain-specific scenarios and different backbone models. All resources will be available at https://github.com/JiaQiSJTU/IterIT.

Boosting LLM via Learning from Data Iteratively and Selectively

TL;DR

This work tackles data quality issues in instruction tuning arising from multi-source data with noise and duplication. It introduces IterIT, a data-selection framework that jointly optimizes a model-specific complexity score and a response-diversity score to iteratively curate high-value instruction-response pairs, updating complexity across epochs. Through extensive experiments on diverse datasets and backbones, IterIT achieves strong average performance and robust generalization, while ablations confirm the value of iterative selection and diversity-aware scoring. The results suggest that model-data collaboration during post-training can yield substantial gains with only a modest data budget, enabling efficient instruction tuning in real-world settings.

Abstract

Datasets nowadays are generally constructed from multiple sources and using different synthetic techniques, making data de-noising and de-duplication crucial before being used for post-training. In this work, we propose to perform instruction tuning by iterative data selection (\ApproachName{}). We measure the quality of a sample from complexity and diversity simultaneously. Instead of calculating the complexity score once for all before fine-tuning, we highlight the importance of updating this model-specific score during fine-tuning to accurately accommodate the dynamic changes of the model. On the other hand, the diversity score is defined on top of the samples' responses under the consideration of their informativeness. IterIT integrates the strengths of both worlds by iteratively updating the complexity score for the top-ranked samples and greedily selecting the ones with the highest complexity-diversity score. Experiments on multiple instruction-tuning data demonstrate consistent improvements of IterIT over strong baselines. Moreover, our approach also generalizes well to domain-specific scenarios and different backbone models. All resources will be available at https://github.com/JiaQiSJTU/IterIT.

Paper Structure

This paper contains 26 sections, 8 equations, 6 figures, 8 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustrations of Vanilla, other baselines and IterIT . Grey boxes represent the training data that hasn't been assessed, which will be ranked by different metrics, i.e., the colored boxes. The red arrows in IterIT emphasize the collaboration between the model and the data. In other words, the model will supervise the data selection process, while the selected samples will be used to update the model's parameters.
  • Figure 2: Ablation on the need of iterative selection. Models are evaluated by the average performance(%) over 7 datasets.
  • Figure 3: The response length of instruction-tuning data selected from Alpaca by different approaches.
  • Figure 4: The average scores(%) of models trained on Alpaca under different hyper-parameters.
  • Figure 5: Response lengths of instruction-tuning data selected from Alpaca-GPT4 and WizardLM.
  • ...and 1 more figures