Curry-DPO: Enhancing Alignment using Curriculum Learning & Ranked Preferences

Pulkit Pattnaik; Rishabh Maheshwary; Kelechi Ogueji; Vikas Yadav; Sathwik Tejaswi Madhusudhan

Curry-DPO: Enhancing Alignment using Curriculum Learning & Ranked Preferences

Pulkit Pattnaik, Rishabh Maheshwary, Kelechi Ogueji, Vikas Yadav, Sathwik Tejaswi Madhusudhan

TL;DR

This work consistently shows increased performance gains on MTbench, Vicuna, WizardLM, and the UltraFeedback test set, highlighting its effectiveness, and focuses on systematically using the constructed multiple preference pair in DPO training via curriculum learning methodology.

Abstract

Direct Preference Optimization (DPO) is an effective technique that leverages pairwise preference data (usually one chosen and rejected response pair per user prompt) to align LLMs to human preferences. In practice, multiple responses can exist for a given prompt with varying quality relative to each other. With availability of such quality ratings for multiple responses, we propose utilizing these responses to create multiple preference pairs for a given prompt. Our work focuses on systematically using the constructed multiple preference pair in DPO training via curriculum learning methodology. In particular, we order these multiple pairs of preference data from easy to hard (emulating curriculum training) according to various criteria. We show detailed comparisons of our proposed approach to the standard single-pair DPO setting. Our method, which we call Curry-DPO consistently shows increased performance gains on MTbench, Vicuna, WizardLM, and the UltraFeedback test set, highlighting its effectiveness. More specifically, Curry-DPO achieves a score of 7.43 on MT-bench with Zephy-7B model outperforming majority of existing LLMs with similar parameter size. Curry-DPO also achieves the highest adjusted win rates on Vicuna, WizardLM, and UltraFeedback test datasets (90.7%, 87.1%, and 87.9% respectively) in our experiments, with notable gains of upto 7.5% when compared to standard DPO technique. We release the preference pairs used in alignment at: https://huggingface.co/datasets/ServiceNow-AI/Curriculum_DPO_preferences

Curry-DPO: Enhancing Alignment using Curriculum Learning & Ranked Preferences

TL;DR

Abstract

Paper Structure (36 sections, 2 equations, 6 figures, 7 tables)

This paper contains 36 sections, 2 equations, 6 figures, 7 tables.

Introduction
Related Work
Aligning LLMs to Human Preferences
Curriculum Learning
Approach
Sampling Multiple Responses per Prompt
Curating and Arranging Multiple Preference Pairs
Training methodology
Experimental Setup
Datasets
Models
Evaluation
MT-Bench
Vicuna bench
WizardLM
...and 21 more sections

Figures (6)

Figure 1: Top part of the figure demonstrates the steps to create multiple preference pairs for Curri-DPO. Each of the 4 responses for the given prompt are ranked as per their scores. The computed pairwise score differences are then used to rank the preference pairs. The lower right block represents multiple iterations of Curri-DPO. Iteration 1 uses the easiest preference pair $(Y_w=R_1, Y_L = R_4)$, Iteration 2 uses the 2nd "easiest" ranked preference pair $(Y_w=R_1, Y_L = R_3)$ and so on. The SFT model acts as a reference model for Iteration 1, similarly Iteration 1 model acts as a reference model for Iteration 2 and so on.
Figure 2: MT Bench result comparison
Figure 3: GPT-4 evaluation prompt for single grading MT bench questions.
Figure 4: GPT-4 evaluation prompt for Vicuna and WizardLM pairwise grading.
Figure 5: GPT-4 evaluation prompt chain-of-thought math and reasoning questions.
...and 1 more figures

Curry-DPO: Enhancing Alignment using Curriculum Learning & Ranked Preferences

TL;DR

Abstract

Curry-DPO: Enhancing Alignment using Curriculum Learning & Ranked Preferences

Authors

TL;DR

Abstract

Table of Contents

Figures (6)