Table of Contents
Fetching ...

Curry-DPO: Enhancing Alignment using Curriculum Learning & Ranked Preferences

Pulkit Pattnaik, Rishabh Maheshwary, Kelechi Ogueji, Vikas Yadav, Sathwik Tejaswi Madhusudhan

TL;DR

This work consistently shows increased performance gains on MTbench, Vicuna, WizardLM, and the UltraFeedback test set, highlighting its effectiveness, and focuses on systematically using the constructed multiple preference pair in DPO training via curriculum learning methodology.

Abstract

Direct Preference Optimization (DPO) is an effective technique that leverages pairwise preference data (usually one chosen and rejected response pair per user prompt) to align LLMs to human preferences. In practice, multiple responses can exist for a given prompt with varying quality relative to each other. With availability of such quality ratings for multiple responses, we propose utilizing these responses to create multiple preference pairs for a given prompt. Our work focuses on systematically using the constructed multiple preference pair in DPO training via curriculum learning methodology. In particular, we order these multiple pairs of preference data from easy to hard (emulating curriculum training) according to various criteria. We show detailed comparisons of our proposed approach to the standard single-pair DPO setting. Our method, which we call Curry-DPO consistently shows increased performance gains on MTbench, Vicuna, WizardLM, and the UltraFeedback test set, highlighting its effectiveness. More specifically, Curry-DPO achieves a score of 7.43 on MT-bench with Zephy-7B model outperforming majority of existing LLMs with similar parameter size. Curry-DPO also achieves the highest adjusted win rates on Vicuna, WizardLM, and UltraFeedback test datasets (90.7%, 87.1%, and 87.9% respectively) in our experiments, with notable gains of upto 7.5% when compared to standard DPO technique. We release the preference pairs used in alignment at: https://huggingface.co/datasets/ServiceNow-AI/Curriculum_DPO_preferences

Curry-DPO: Enhancing Alignment using Curriculum Learning & Ranked Preferences

TL;DR

This work consistently shows increased performance gains on MTbench, Vicuna, WizardLM, and the UltraFeedback test set, highlighting its effectiveness, and focuses on systematically using the constructed multiple preference pair in DPO training via curriculum learning methodology.

Abstract

Direct Preference Optimization (DPO) is an effective technique that leverages pairwise preference data (usually one chosen and rejected response pair per user prompt) to align LLMs to human preferences. In practice, multiple responses can exist for a given prompt with varying quality relative to each other. With availability of such quality ratings for multiple responses, we propose utilizing these responses to create multiple preference pairs for a given prompt. Our work focuses on systematically using the constructed multiple preference pair in DPO training via curriculum learning methodology. In particular, we order these multiple pairs of preference data from easy to hard (emulating curriculum training) according to various criteria. We show detailed comparisons of our proposed approach to the standard single-pair DPO setting. Our method, which we call Curry-DPO consistently shows increased performance gains on MTbench, Vicuna, WizardLM, and the UltraFeedback test set, highlighting its effectiveness. More specifically, Curry-DPO achieves a score of 7.43 on MT-bench with Zephy-7B model outperforming majority of existing LLMs with similar parameter size. Curry-DPO also achieves the highest adjusted win rates on Vicuna, WizardLM, and UltraFeedback test datasets (90.7%, 87.1%, and 87.9% respectively) in our experiments, with notable gains of upto 7.5% when compared to standard DPO technique. We release the preference pairs used in alignment at: https://huggingface.co/datasets/ServiceNow-AI/Curriculum_DPO_preferences
Paper Structure (36 sections, 2 equations, 6 figures, 7 tables)

This paper contains 36 sections, 2 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Top part of the figure demonstrates the steps to create multiple preference pairs for Curri-DPO. Each of the 4 responses for the given prompt are ranked as per their scores. The computed pairwise score differences are then used to rank the preference pairs. The lower right block represents multiple iterations of Curri-DPO. Iteration 1 uses the easiest preference pair $(Y_w=R_1, Y_L = R_4)$, Iteration 2 uses the 2nd "easiest" ranked preference pair $(Y_w=R_1, Y_L = R_3)$ and so on. The SFT model acts as a reference model for Iteration 1, similarly Iteration 1 model acts as a reference model for Iteration 2 and so on.
  • Figure 2: MT Bench result comparison
  • Figure 3: GPT-4 evaluation prompt for single grading MT bench questions.
  • Figure 4: GPT-4 evaluation prompt for Vicuna and WizardLM pairwise grading.
  • Figure 5: GPT-4 evaluation prompt chain-of-thought math and reasoning questions.
  • ...and 1 more figures