Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level

Jie Liu; Zhanhui Zhou; Jiaheng Liu; Xingyuan Bu; Chao Yang; Han-Sen Zhong; Wanli Ouyang

Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level

Jie Liu, Zhanhui Zhou, Jiaheng Liu, Xingyuan Bu, Chao Yang, Han-Sen Zhong, Wanli Ouyang

TL;DR

This paper analyzes iterative direct preference optimization (iDPO) and identifies a pitfall where improved responses become more verbose. It introduces Iterative Length-Regularized DPO (iLR-DPO), a multi-objective framework that adds a length penalty to DPO to control verbosity while aligning with human preferences via an online reward model. Through a case study on a 7B open-source model, the method achieves GPT-4-level alignment on AlpacaEval 2.0 and demonstrates strong performance on MT-Bench, Arena-Hard, and the Open LLM Leaderboard, while keeping response length in check. The work also provides ablations, implementation details, and releases open-source resources to enable further research and evaluation.

Abstract

Direct Preference Optimization (DPO), a standard method for aligning language models with human preferences, is traditionally applied to offline preferences. Recent studies show that DPO benefits from iterative training with online preferences labeled by a trained reward model. In this work, we identify a pitfall of vanilla iterative DPO - improved response quality can lead to increased verbosity. To address this, we introduce iterative length-regularized DPO (iLR-DPO) to penalize response length. Our empirical results show that iLR-DPO can enhance a 7B model to perform on par with GPT-4 without increasing verbosity. Specifically, our 7B model achieves a $50.5\%$ length-controlled win rate against $\texttt{GPT-4 Preview}$ on AlpacaEval 2.0, and excels across standard benchmarks including MT-Bench, Arena-Hard and OpenLLM Leaderboard. These results demonstrate the effectiveness of iterative DPO in aligning language models with human feedback.

Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level

TL;DR

Abstract

length-controlled win rate against

on AlpacaEval 2.0, and excels across standard benchmarks including MT-Bench, Arena-Hard and OpenLLM Leaderboard. These results demonstrate the effectiveness of iterative DPO in aligning language models with human feedback.

Paper Structure (31 sections, 3 equations, 1 figure, 3 tables)

This paper contains 31 sections, 3 equations, 1 figure, 3 tables.

Introduction
Iterative Length-Regularized DPO (iLR-DPO)
Synthetic Preference Collection
Length-Regularized DPO (LR-DPO)
End-to-End Iterative Training Pipeline
Experiments
Experimental Setup
Base Model.
Prompt & Reward Model.
Evaluation Metrics.
Implementation Details
Training.
Generation.
Experimental Results
AlpacaEval 2.0 Leaderboard.
...and 16 more sections

Figures (1)

Figure 1: Length-controlled win rates and response lengths on AlpacaEval 2.0. iLR-DPO enhances performance without significantly increasing response length. The trained model achieves a $50.5\%$ length-controlled win rate against GPT-4 Preview, making it the first open-source model to match GPT-4 Preview.

Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level

TL;DR

Abstract

Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level

Authors

TL;DR

Abstract

Table of Contents

Figures (1)