Table of Contents
Fetching ...

DavIR: Data Selection via Implicit Reward for Large Language Models

Haotian Zhou, Tingkai Liu, Qianli Ma, Yufeng Zhang, Jianbo Yuan, Pengfei Liu, Yang You, Hongxia Yang

TL;DR

DavIR addresses the challenge of data selection for post-training LLMs by defining a per-datum learnability score and normalizing the loss gap to mitigate length bias, linking this score to an implicit reward framework. It generalizes Reducible Holdout Loss to core-set selection and introduces a normalized DavIR-DPO objective, enabling effective data pruning and alignment improvements without relying on teacher models. Empirically, a small subset of Alpaca data selected by DavIR can outperform the full 52K dataset across model families like LLaMA and Gemma, and DavIR enables beneficial data mixtures (e.g., Alpaca-4 with GSM8K) to balance open-domain QA and mathematical reasoning; the DavIR-DPO variant also yields an 8% relative improvement on AlpacaEval for Zephyr-7B-SFT. Overall, DavIR demonstrates robust, model-dependent data selection advantages across domains and scales, with clear pathways to integration into data flywheels and broader applicability to reasoning tasks, while acknowledging limitations around data quality/diversity and domain specificity.

Abstract

We introduce DavIR, a model-based data selection method for post-training Large Language Models. DavIR generalizes Reducible Holdout Loss to core-set selection problem of causal language modeling, and quantifies the learnability of a given datum with respect to a pre-trained LLM based on relative reduction in loss during fine-tuning, a metric we show to be closely related to the implicit reward model described in Direct Preference Optimization (DPO). We show that 6% of Alpaca dataset selected with DavIR can steer both the LLaMA and Gemma model family to produce superior performance compared to the same models trained on the full 52K dataset. We also show that Alpaca dataset compressed with DavIR can be combined with GSM8K dataset to effectively balance open-domain freeform QA and mathematical reasoning capabilities. Finally, we apply the DavIR objective to DPO and develop a normalized DavIR-DPO objective which improves alignment performance of Zephyr-7B-SFT model by 8% (relative) on AlpacaEval, compared against training on vanilla DPO objective.

DavIR: Data Selection via Implicit Reward for Large Language Models

TL;DR

DavIR addresses the challenge of data selection for post-training LLMs by defining a per-datum learnability score and normalizing the loss gap to mitigate length bias, linking this score to an implicit reward framework. It generalizes Reducible Holdout Loss to core-set selection and introduces a normalized DavIR-DPO objective, enabling effective data pruning and alignment improvements without relying on teacher models. Empirically, a small subset of Alpaca data selected by DavIR can outperform the full 52K dataset across model families like LLaMA and Gemma, and DavIR enables beneficial data mixtures (e.g., Alpaca-4 with GSM8K) to balance open-domain QA and mathematical reasoning; the DavIR-DPO variant also yields an 8% relative improvement on AlpacaEval for Zephyr-7B-SFT. Overall, DavIR demonstrates robust, model-dependent data selection advantages across domains and scales, with clear pathways to integration into data flywheels and broader applicability to reasoning tasks, while acknowledging limitations around data quality/diversity and domain specificity.

Abstract

We introduce DavIR, a model-based data selection method for post-training Large Language Models. DavIR generalizes Reducible Holdout Loss to core-set selection problem of causal language modeling, and quantifies the learnability of a given datum with respect to a pre-trained LLM based on relative reduction in loss during fine-tuning, a metric we show to be closely related to the implicit reward model described in Direct Preference Optimization (DPO). We show that 6% of Alpaca dataset selected with DavIR can steer both the LLaMA and Gemma model family to produce superior performance compared to the same models trained on the full 52K dataset. We also show that Alpaca dataset compressed with DavIR can be combined with GSM8K dataset to effectively balance open-domain freeform QA and mathematical reasoning capabilities. Finally, we apply the DavIR objective to DPO and develop a normalized DavIR-DPO objective which improves alignment performance of Zephyr-7B-SFT model by 8% (relative) on AlpacaEval, compared against training on vanilla DPO objective.
Paper Structure (33 sections, 1 theorem, 11 equations, 7 figures, 13 tables, 1 algorithm)

This paper contains 33 sections, 1 theorem, 11 equations, 7 figures, 13 tables, 1 algorithm.

Key Result

Proposition 1

Choosing either $\mathcal{L}_{ref}(x, y)$ or $\mathcal{L}_{base}(x, y)$ as the denominator for normalization does not affect the ranking of the learnability score. Specifically, if then it also holds that

Figures (7)

  • Figure 1: DavIR outperforms full data fine-tuning and data selection based on teacher LLM across model scales. Performance comparison of 7B and 13B parameter models fine-tuned with data selected using DavIR (3,000 items), the full Alpaca dataset (52K), and data filtered using ChatGPT (9,229 items). "G" represents evaluation using GPT-4, and "H" represents human evaluation. The statistical significance of performance gain of DavIR over training on full dataset and other core-set selection methods are established in subsequent sections.
  • Figure 2: Models fine-tuned with data selected by DavIR surpass the full dataset on Alpaca3.5. This figure shows the win score comparison between models trained with different sizes of datasets and the full dataset, as well as the improvement brought by using the normalization method. We select the model fine-tuned on the full dataset as the baseline. Win Score is computed as $1+(N_{win}-N_{lose})/N_{total}$, with $1$ being equal performance.
  • Figure 3: DavIR significantly out perform random sampling. Using Text-Davinci-003 as the frozen baseline model, we show that performance of random selection of the Alpaca-4 dataset scales logarithmically with number of training data, significantly under-performing DavIR. Note that the x-axis is log-scale. Win Rate is computed as $N_{win}/N_{total}$, where $N_{win}, N_{total}$ are number of win and total number of test data.
  • Figure 4: Data mixing with LLaMA-7B and DavIR. The x-axis represents the number of selected Alpaca-4 data points, plotted on a logarithmic scale.
  • Figure 5: GSM8K training data was in-compressible with LLaMA-7B.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof