Table of Contents
Fetching ...

D2PO: Discriminator-Guided DPO with Response Evaluation Models

Prasann Singhal, Nathan Lambert, Scott Niekum, Tanya Goyal, Greg Durrett

TL;DR

D2PO introduces a discriminator-guided Direct Preference Optimization framework for online LM alignment, where a discriminative response evaluator is trained on gold preferences and used to silver-label additional policy outputs. This two-phase cycle—discriminator training from limited human labels and subsequent silver-label-based policy updates—improves data efficiency and performance relative to standard DPO and PPO baselines, especially under distribution shift. Across synthetic tasks and UltraFeedback, D2PO demonstrates faster convergence to higher rewards with the same gold preference budget, and analyses show online discriminators retain reliable reward signals where static discriminators deteriorate. The work also shows that maintaining a separate discriminator from the policy helps when online labeling is constrained, suggesting practical guidance for deploying discriminator-based evaluation in real-world alignment. Overall, D2PO advances data-efficient, online preference learning by leveraging discriminative evaluation to bolster policy training without proportional increases in human labeling cost.

Abstract

Varied approaches for aligning language models have been proposed, including supervised fine-tuning, RLHF, and direct optimization methods such as DPO. Although DPO has rapidly gained popularity due to its straightforward training process and competitive results, there is an open question of whether there remain practical advantages of using a discriminator, like a reward model, to evaluate responses. We propose D2PO, discriminator-guided DPO, an approach for the online setting where preferences are being collected throughout learning. As we collect gold preferences, we use these not only to train our policy, but to train a discriminative response evaluation model to silver-label even more synthetic data for policy training. We explore this approach across a set of diverse tasks, including a realistic chat setting, we find that our approach leads to higher-quality outputs compared to DPO with the same data budget, and greater efficiency in terms of preference data requirements. Furthermore, we show conditions under which silver labeling is most helpful: it is most effective when training the policy with DPO, outperforming traditional PPO, and benefits from maintaining a separate discriminator from the policy model.

D2PO: Discriminator-Guided DPO with Response Evaluation Models

TL;DR

D2PO introduces a discriminator-guided Direct Preference Optimization framework for online LM alignment, where a discriminative response evaluator is trained on gold preferences and used to silver-label additional policy outputs. This two-phase cycle—discriminator training from limited human labels and subsequent silver-label-based policy updates—improves data efficiency and performance relative to standard DPO and PPO baselines, especially under distribution shift. Across synthetic tasks and UltraFeedback, D2PO demonstrates faster convergence to higher rewards with the same gold preference budget, and analyses show online discriminators retain reliable reward signals where static discriminators deteriorate. The work also shows that maintaining a separate discriminator from the policy helps when online labeling is constrained, suggesting practical guidance for deploying discriminator-based evaluation in real-world alignment. Overall, D2PO advances data-efficient, online preference learning by leveraging discriminative evaluation to bolster policy training without proportional increases in human labeling cost.

Abstract

Varied approaches for aligning language models have been proposed, including supervised fine-tuning, RLHF, and direct optimization methods such as DPO. Although DPO has rapidly gained popularity due to its straightforward training process and competitive results, there is an open question of whether there remain practical advantages of using a discriminator, like a reward model, to evaluate responses. We propose D2PO, discriminator-guided DPO, an approach for the online setting where preferences are being collected throughout learning. As we collect gold preferences, we use these not only to train our policy, but to train a discriminative response evaluation model to silver-label even more synthetic data for policy training. We explore this approach across a set of diverse tasks, including a realistic chat setting, we find that our approach leads to higher-quality outputs compared to DPO with the same data budget, and greater efficiency in terms of preference data requirements. Furthermore, we show conditions under which silver labeling is most helpful: it is most effective when training the policy with DPO, outperforming traditional PPO, and benefits from maintaining a separate discriminator from the policy model.
Paper Structure (26 sections, 12 figures, 10 tables, 1 algorithm)

This paper contains 26 sections, 12 figures, 10 tables, 1 algorithm.

Figures (12)

  • Figure 1: Comparison of standard DPO, online preference optimization methods (with reward model-labeled data), and our proposed D2PO method. The key addition in (c) is the online learning of the reward model on new preferences during policy optimization.
  • Figure 2: D2PO trains an initial policy model and response evaluation model from gold preferences. It then samples prompts, samples outputs of those prompts, and uses a mix of human labeling and silver labeling to produce policy training data. Only human-labeled data is used to update the response evaluation model.
  • Figure 3: Amount of gold preference data (x-axis; corresponds to progress through training, not counting initial 1.6k offline prefs) vs. resulting gold reward, averaged over 3 runs. We compare D2PO against OPO with gold data only, as well as "basic" DPO and OPO with an RM trained on initial data (note that this is a smaller set than in Table \ref{['tab:traininitanalysis']}). Our method reaches higher reward in Word Collector and Contrastive Distillation, and maxes out faster at Unique Nouns.
  • Figure 4: (Left) Gold reward over training on UltraFeedback, (Right) Eurus RM D2PO vs. OPO with a budget of 500 preferences. The dashed line represents UltraFeedback reward for the highest reward point with OPO with the initial model. D2PO outperforms OPO on this setting.
  • Figure 5: Reward model accuracy (y-axis) vs. training progress (x-axis) for our datasets using OPO (static, 50k RM). The discriminative capability of the reward model degrades substantially as training progresses, ending up near random chance.
  • ...and 7 more figures