Table of Contents
Fetching ...

Take the Bull by the Horns: Hard Sample-Reweighted Continual Training Improves LLM Generalization

Xuxi Chen, Zhendong Wang, Daouda Sow, Junjie Yang, Tianlong Chen, Yingbin Liang, Mingyuan Zhou, Zhangyang Wang

TL;DR

The paper tackles data scarcity in continual training of large language models by showing that samples with moderately high losses can be more informative than the highest-loss examples, due to noise and complexity in the latter. It introduces MidRanking as an empirical loss-based sample-selection strategy and IR-DRO as a principled, closed-form instance-weighting framework grounded in distributionally robust optimization, enabling seamless integration into existing pipelines. Across multiple models and tasks (continual pretraining and instruction tuning) and datasets, the proposed methods yield consistent performance gains with modest computational overhead, including notable improvements on MMLU. The work provides practical algorithms and open-source code to facilitate data reweighting and targeted sampling in LLM training workflows.

Abstract

In the rapidly advancing arena of large language models (LLMs), a key challenge is to enhance their capabilities amid a looming shortage of high-quality training data. Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets, with a specific focus on selective retention of samples that incur moderately high losses. These samples are deemed informative and beneficial for model refinement, contrasting with the highest-loss samples, which would be discarded due to their correlation with data noise and complexity. We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization (IR-DRO). IR-DRO is designed to dynamically prioritize the training focus on informative samples through an instance reweighting mechanism, streamlined by a closed-form solution for straightforward integration into established training protocols. Through rigorous experimentation with various models and datasets, our findings indicate that our sample-targeted methods significantly improve LLM performance across multiple benchmarks, in both continual pre-training and instruction tuning scenarios. Our codes are available at https://github.com/VITA-Group/HardFocusTraining.

Take the Bull by the Horns: Hard Sample-Reweighted Continual Training Improves LLM Generalization

TL;DR

The paper tackles data scarcity in continual training of large language models by showing that samples with moderately high losses can be more informative than the highest-loss examples, due to noise and complexity in the latter. It introduces MidRanking as an empirical loss-based sample-selection strategy and IR-DRO as a principled, closed-form instance-weighting framework grounded in distributionally robust optimization, enabling seamless integration into existing pipelines. Across multiple models and tasks (continual pretraining and instruction tuning) and datasets, the proposed methods yield consistent performance gains with modest computational overhead, including notable improvements on MMLU. The work provides practical algorithms and open-source code to facilitate data reweighting and targeted sampling in LLM training workflows.

Abstract

In the rapidly advancing arena of large language models (LLMs), a key challenge is to enhance their capabilities amid a looming shortage of high-quality training data. Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets, with a specific focus on selective retention of samples that incur moderately high losses. These samples are deemed informative and beneficial for model refinement, contrasting with the highest-loss samples, which would be discarded due to their correlation with data noise and complexity. We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization (IR-DRO). IR-DRO is designed to dynamically prioritize the training focus on informative samples through an instance reweighting mechanism, streamlined by a closed-form solution for straightforward integration into established training protocols. Through rigorous experimentation with various models and datasets, our findings indicate that our sample-targeted methods significantly improve LLM performance across multiple benchmarks, in both continual pre-training and instruction tuning scenarios. Our codes are available at https://github.com/VITA-Group/HardFocusTraining.
Paper Structure (18 sections, 9 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 18 sections, 9 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: We introduce IR-DRO, a principled optimization-based framework named that automatically decides the importance scores at instance level and reweight the training process.
  • Figure 2: The distribution of the coefficients generated by IR-DRO during continual pre-training. We visualize both the frequency of the coefficients and also the cumulative percentage.
  • Figure 3: Models' performance after training with different number of batches. We evaluate the continually trained OPT-350M using IR-DRO on three datasets (Arc-Challenge, BoolQ and WinoGrande), and report their average scores.
  • Figure 4: Models' performance after training with different learning rates. We evaluate continually trained OPT-350M model using IR-DRO on three datasets (Arc-Challenge, BoolQ and WinoGrande), and calculate their average scores.