A Unified Understanding of Offline Data Selection and Online Self-refining Generation for Post-training LLMs
Quan Xiao, Tianyi Chen
TL;DR
The work addresses how to improve post-training LLM fine-tuning by unifying offline data selection and online self-refining generation under a bilevel optimization framework. It introduces $BDS$ (bilevel data selection) and $BMO$ (bilevel multi-objective) and proves their equivalence under a separability assumption, showing that selecting validation-aligned data can outperform naive mixing of SFT and validation data. The framework is extended to online self-refinement with importance sampling, providing a principled way to weight generated responses by their alignment to validation data. Empirical results in quality-enhancement and safety-aware tuning demonstrate superior performance over baselines, with insights into data weighting, online exploration, and dynamic strategies for unsafe data. The approach offers data-efficient, validation-guided fine-tuning with potential impact on safer and more reliable LLM deployment.
Abstract
Offline data selection and online self-refining generation, which enhance the data quality, are crucial steps in adapting large language models (LLMs) to specific downstream tasks. We tackle offline data selection and online self-refining generations through an optimization perspective. Specifically, bilevel data selection is used for offline data selection with respect to the validation dataset, and we treat online self-refining generation as a model adaptation step of selecting the model trained on current responses that best fits the validation data. Our framework offers a unified understanding of offline data selection and self-refining generation by assigning a learned data weight to each question and response, either explicitly or implicitly. For the first time, we theoretically demonstrate the effectiveness of the bilevel data selection framework and demonstrate its performance gains over unfiltered direct mixing baselines. By combining offline data with validation-weighted online generations, our method enhances fine-tuning performance. Experiments on quality enhancement and safety-aware LLM fine-tuning validate its effectiveness.
