Boosting LLM via Learning from Data Iteratively and Selectively
Qi Jia, Siyu Ren, Ziheng Qin, Fuzhao Xue, Jinjie Ni, Yang You
TL;DR
This work tackles data quality issues in instruction tuning arising from multi-source data with noise and duplication. It introduces IterIT, a data-selection framework that jointly optimizes a model-specific complexity score $S_{ m COM}^{i,\theta}$ and a response-diversity score $S_{ m DIV}^i$ to iteratively curate high-value instruction-response pairs, updating complexity across epochs. Through extensive experiments on diverse datasets and backbones, IterIT achieves strong average performance and robust generalization, while ablations confirm the value of iterative selection and diversity-aware scoring. The results suggest that model-data collaboration during post-training can yield substantial gains with only a modest data budget, enabling efficient instruction tuning in real-world settings.
Abstract
Datasets nowadays are generally constructed from multiple sources and using different synthetic techniques, making data de-noising and de-duplication crucial before being used for post-training. In this work, we propose to perform instruction tuning by iterative data selection (\ApproachName{}). We measure the quality of a sample from complexity and diversity simultaneously. Instead of calculating the complexity score once for all before fine-tuning, we highlight the importance of updating this model-specific score during fine-tuning to accurately accommodate the dynamic changes of the model. On the other hand, the diversity score is defined on top of the samples' responses under the consideration of their informativeness. IterIT integrates the strengths of both worlds by iteratively updating the complexity score for the top-ranked samples and greedily selecting the ones with the highest complexity-diversity score. Experiments on multiple instruction-tuning data demonstrate consistent improvements of IterIT over strong baselines. Moreover, our approach also generalizes well to domain-specific scenarios and different backbone models. All resources will be available at https://github.com/JiaQiSJTU/IterIT.
