Table of Contents
Fetching ...

A Unified Understanding of Offline Data Selection and Online Self-refining Generation for Post-training LLMs

Quan Xiao, Tianyi Chen

TL;DR

The work addresses how to improve post-training LLM fine-tuning by unifying offline data selection and online self-refining generation under a bilevel optimization framework. It introduces $BDS$ (bilevel data selection) and $BMO$ (bilevel multi-objective) and proves their equivalence under a separability assumption, showing that selecting validation-aligned data can outperform naive mixing of SFT and validation data. The framework is extended to online self-refinement with importance sampling, providing a principled way to weight generated responses by their alignment to validation data. Empirical results in quality-enhancement and safety-aware tuning demonstrate superior performance over baselines, with insights into data weighting, online exploration, and dynamic strategies for unsafe data. The approach offers data-efficient, validation-guided fine-tuning with potential impact on safer and more reliable LLM deployment.

Abstract

Offline data selection and online self-refining generation, which enhance the data quality, are crucial steps in adapting large language models (LLMs) to specific downstream tasks. We tackle offline data selection and online self-refining generations through an optimization perspective. Specifically, bilevel data selection is used for offline data selection with respect to the validation dataset, and we treat online self-refining generation as a model adaptation step of selecting the model trained on current responses that best fits the validation data. Our framework offers a unified understanding of offline data selection and self-refining generation by assigning a learned data weight to each question and response, either explicitly or implicitly. For the first time, we theoretically demonstrate the effectiveness of the bilevel data selection framework and demonstrate its performance gains over unfiltered direct mixing baselines. By combining offline data with validation-weighted online generations, our method enhances fine-tuning performance. Experiments on quality enhancement and safety-aware LLM fine-tuning validate its effectiveness.

A Unified Understanding of Offline Data Selection and Online Self-refining Generation for Post-training LLMs

TL;DR

The work addresses how to improve post-training LLM fine-tuning by unifying offline data selection and online self-refining generation under a bilevel optimization framework. It introduces (bilevel data selection) and (bilevel multi-objective) and proves their equivalence under a separability assumption, showing that selecting validation-aligned data can outperform naive mixing of SFT and validation data. The framework is extended to online self-refinement with importance sampling, providing a principled way to weight generated responses by their alignment to validation data. Empirical results in quality-enhancement and safety-aware tuning demonstrate superior performance over baselines, with insights into data weighting, online exploration, and dynamic strategies for unsafe data. The approach offers data-efficient, validation-guided fine-tuning with potential impact on safer and more reliable LLM deployment.

Abstract

Offline data selection and online self-refining generation, which enhance the data quality, are crucial steps in adapting large language models (LLMs) to specific downstream tasks. We tackle offline data selection and online self-refining generations through an optimization perspective. Specifically, bilevel data selection is used for offline data selection with respect to the validation dataset, and we treat online self-refining generation as a model adaptation step of selecting the model trained on current responses that best fits the validation data. Our framework offers a unified understanding of offline data selection and self-refining generation by assigning a learned data weight to each question and response, either explicitly or implicitly. For the first time, we theoretically demonstrate the effectiveness of the bilevel data selection framework and demonstrate its performance gains over unfiltered direct mixing baselines. By combining offline data with validation-weighted online generations, our method enhances fine-tuning performance. Experiments on quality enhancement and safety-aware LLM fine-tuning validate its effectiveness.

Paper Structure

This paper contains 28 sections, 9 theorems, 69 equations, 6 figures, 10 tables, 1 algorithm.

Key Result

Theorem 1

Suppose Assumption ass:seperable holds. For $\texttt{BMO}$ in (eq:BMOL) with $\mathcal{L}_i(\theta)=\mathcal{L}_{\text{SFT}} (\theta;x^i,y^i)$ for $i\in [N]$, where $(x^i,y^i)\in \mathcal{D}_{\text{SFT}}^{-}=\{(x^i,y^i)\}_{i=1}^{N}$ and $M=N$, any global (or local) solution $\theta^*$ of $\texttt{BM

Figures (6)

  • Figure 1: An overview of bilevel data selection principle. 'WP' in the figure is short for weak Pareto optimal set, 'w.o' is short for without, and $1-3$ samples are drawn from the lower-level SFT dataset. The orange $2$D plates depict the sets of per-sample SFT-loss minimizers, while the blue surface denotes the $3$D validation-loss landscape. validation loss achieves minimum on both individual minimizer of $3^{\text{rd}}$ sample and the shared minimum of $1^{\text{st}}$ and $3^{\text{rd}}$. Optimizing the validation loss at the shared minimum of all lower-level sample losses degrades performance, but it achieves optimum if we remove $2^{\text{nd}}$ sample.
  • Figure 2: An overview of our online self-refining algorithm design. 'Q' and 'A' are short for Question and Answer. We masked part of the offline responses to the question and generated on-policy responses instead. We assign both question-level validation score via bilevel data selection (BDS) and response-level data weight via bilevel multi-objective learning (BMO).
  • Figure 3: An overview of key steps for establishing the theorems and the relations of each theorem.
  • Figure 4: Ablation study of our algorithm (online) and comparisons with other baselines. Fine-tuning loss on Alpaca-cleaned dataset (lower-level) finetuned with Llama-3-8b-Instruct model on validation tuning task. $R=N_M/N$ denotes the online sample ratio, $G$ is the number of responses generated per question, and $\rho$ is the mixing ratio of upper-level and lower-level datasets for the direct mixing approach.
  • Figure 5: Average response length of top $10\%$ questions ranked by learned data weights via offline selection and online self-refining approach. Self-refining approach tends to learn from simple to hard questions.
  • ...and 1 more figures

Theorems & Definitions (21)

  • Definition 1
  • Theorem 1: Equivalence of $\texttt{BMO}$ and $\texttt{BDS}$
  • Definition 2: Individual minimizer
  • Definition 3: Useful samples
  • Remark 1
  • Theorem 2: $\texttt{BDS}$ can select useful data
  • Theorem 3
  • Theorem 4
  • Lemma 5
  • Lemma 6: Implicit response weight given by $\texttt{BMO}$
  • ...and 11 more