Table of Contents
Fetching ...

Length Desensitization in Direct Preference Optimization

Wei Liu, Yang Bai, Chengcheng Han, Rongxiang Weng, Jun Xu, Xuezhi Cao, Jingang Wang, Xunliang Cai

TL;DR

This work reveals that Direct Preference Optimization (DPO) is inherently length-sensitive, causing models to overemphasize verbose responses during offline preference optimization. By theoretically proving that DPO’s gradient direction is biased by data length, the authors derive LD-DPO, which decouples length influence from substantive preferences through a tunable parameter $\alpha$ that modifies the sequence likelihood. Empirical evaluation across multiple models and benchmarks (MT-Bench, AlpacaEval 2, Arena-Hard) demonstrates that LD-DPO yields shorter, more concise outputs (10–40% reduction) and improves reasoning, with performance gains varying by model capability. The results establish LD-DPO as a practical and effective approach to align LLMs with human preferences while preserving or enhancing reasoning quality, and they introduce a quantitative measure of length sensitivity via the coefficient $\gamma$ to compare models.

Abstract

Direct Preference Optimization (DPO) is widely utilized in the Reinforcement Learning from Human Feedback (RLHF) phase to align Large Language Models (LLMs) with human preferences, thereby enhancing both their harmlessness and efficacy. However, it has been observed that DPO tends to over-optimize for verbosity, which can detrimentally affect both performance and user experience. In this paper, we conduct an in-depth theoretical analysis of DPO's optimization objective and reveal a strong correlation between its implicit reward and data length. This correlation misguides the optimization direction, resulting in length sensitivity during the DPO training and leading to verbosity. To address this issue, we propose a length-desensitization improvement method for DPO, termed LD-DPO. The proposed method aims to desensitize DPO to data length by decoupling explicit length preference, which is relatively insignificant, from the other implicit preferences, thereby enabling more effective learning of the intrinsic preferences. We utilized two settings (Base and Instruct) of Llama2-13B, Llama3-8B, and Qwen2-7B for experimental validation on various benchmarks including MT-Bench and AlpacaEval 2. The experimental results indicate that LD-DPO consistently outperforms DPO and other baseline methods, achieving more concise responses with a 10-40% reduction in length compared to DPO. We conducted in-depth experimental analyses to demonstrate that LD-DPO can indeed achieve length desensitization and align the model more closely with human-like preferences.

Length Desensitization in Direct Preference Optimization

TL;DR

This work reveals that Direct Preference Optimization (DPO) is inherently length-sensitive, causing models to overemphasize verbose responses during offline preference optimization. By theoretically proving that DPO’s gradient direction is biased by data length, the authors derive LD-DPO, which decouples length influence from substantive preferences through a tunable parameter that modifies the sequence likelihood. Empirical evaluation across multiple models and benchmarks (MT-Bench, AlpacaEval 2, Arena-Hard) demonstrates that LD-DPO yields shorter, more concise outputs (10–40% reduction) and improves reasoning, with performance gains varying by model capability. The results establish LD-DPO as a practical and effective approach to align LLMs with human preferences while preserving or enhancing reasoning quality, and they introduce a quantitative measure of length sensitivity via the coefficient to compare models.

Abstract

Direct Preference Optimization (DPO) is widely utilized in the Reinforcement Learning from Human Feedback (RLHF) phase to align Large Language Models (LLMs) with human preferences, thereby enhancing both their harmlessness and efficacy. However, it has been observed that DPO tends to over-optimize for verbosity, which can detrimentally affect both performance and user experience. In this paper, we conduct an in-depth theoretical analysis of DPO's optimization objective and reveal a strong correlation between its implicit reward and data length. This correlation misguides the optimization direction, resulting in length sensitivity during the DPO training and leading to verbosity. To address this issue, we propose a length-desensitization improvement method for DPO, termed LD-DPO. The proposed method aims to desensitize DPO to data length by decoupling explicit length preference, which is relatively insignificant, from the other implicit preferences, thereby enabling more effective learning of the intrinsic preferences. We utilized two settings (Base and Instruct) of Llama2-13B, Llama3-8B, and Qwen2-7B for experimental validation on various benchmarks including MT-Bench and AlpacaEval 2. The experimental results indicate that LD-DPO consistently outperforms DPO and other baseline methods, achieving more concise responses with a 10-40% reduction in length compared to DPO. We conducted in-depth experimental analyses to demonstrate that LD-DPO can indeed achieve length desensitization and align the model more closely with human-like preferences.
Paper Structure (26 sections, 22 equations, 9 figures, 6 tables)

This paper contains 26 sections, 22 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Performance of iterative DPO modelchenself on Arena-Hard and Alpacaeval 2.
  • Figure 2: Comparison of the relationship between the length of preference data pairs and $\pi_\theta(y_l|x)/\pi_\theta(y_w|x)$ under both DPO and LD-DPO. Measured on Llama3-8B-Instruct with UltraFeedback dataset cui2023ultrafeedback, and the heatmap values represent $\log \pi_\theta(y_l|x)-\log \pi_\theta(y_w|x)$.
  • Figure 3: Exploring the relationship between predicted probability difference $\log \pi_\theta(y_w|x)-\log \pi_\theta(y_l|x)$ and data length difference under different settings: (a) Llama2-13B-Chat; (b) Llama3-8B-Instruct. In each subplot, the left image represents data where the chosen is longer, and the right image represents data where the rejected is longer. DPO-Pub indicates that $\alpha=0$ in LD-DPO. The images depict the true distribution on the UltraFeedback dataset during training.
  • Figure 4: Hyperparametric analysis on $\alpha$ with Llama3-8B-Instruct on AlpacaEval 2(left) and MT-Bench(right).
  • Figure 5: (a)The optimization objective of DPO (b)The partial derivative of $\mathcal{L}_{DPO}(\mathcal{X}_1;\mathcal{X}_2)$ with respect to $\mathcal{X}_1$ (c)The partial derivative of $\mathcal{L}_{DPO}(\mathcal{X}_1;\mathcal{X}_2)$ with respect to $\mathcal{X}_2$, where we denote $\pi_{\theta}(y_w|x)$ by $\mathcal{X}_1$ and $\pi_{\theta}(y_l|x)$ by $\mathcal{X}_2$.
  • ...and 4 more figures