Take the Bull by the Horns: Hard Sample-Reweighted Continual Training Improves LLM Generalization
Xuxi Chen, Zhendong Wang, Daouda Sow, Junjie Yang, Tianlong Chen, Yingbin Liang, Mingyuan Zhou, Zhangyang Wang
TL;DR
The paper tackles data scarcity in continual training of large language models by showing that samples with moderately high losses can be more informative than the highest-loss examples, due to noise and complexity in the latter. It introduces MidRanking as an empirical loss-based sample-selection strategy and IR-DRO as a principled, closed-form instance-weighting framework grounded in distributionally robust optimization, enabling seamless integration into existing pipelines. Across multiple models and tasks (continual pretraining and instruction tuning) and datasets, the proposed methods yield consistent performance gains with modest computational overhead, including notable improvements on MMLU. The work provides practical algorithms and open-source code to facilitate data reweighting and targeted sampling in LLM training workflows.
Abstract
In the rapidly advancing arena of large language models (LLMs), a key challenge is to enhance their capabilities amid a looming shortage of high-quality training data. Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets, with a specific focus on selective retention of samples that incur moderately high losses. These samples are deemed informative and beneficial for model refinement, contrasting with the highest-loss samples, which would be discarded due to their correlation with data noise and complexity. We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization (IR-DRO). IR-DRO is designed to dynamically prioritize the training focus on informative samples through an instance reweighting mechanism, streamlined by a closed-form solution for straightforward integration into established training protocols. Through rigorous experimentation with various models and datasets, our findings indicate that our sample-targeted methods significantly improve LLM performance across multiple benchmarks, in both continual pre-training and instruction tuning scenarios. Our codes are available at https://github.com/VITA-Group/HardFocusTraining.
