Table of Contents
Fetching ...

Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective

Ruichen Shao, Bei Li, Gangao Liu, Yang Chen, Xiang Zhou, Jingang Wang, Xunliang Cai, Peng Li

TL;DR

This work addresses the length bias and suboptimal token-wise contributions in Direct Preference Optimization (DPO) by introducing a temporal decay factor governed by a parameter $\gamma$. The proposed method, D^2PO, applies exponential position weighting to per-token rewards, prioritizing earlier tokens in an autoregressive setting and yielding a tractable loss that remains efficient. The authors provide a token-level MDP analysis and derive an upper bound on suboptimality that reveals a gamma-driven trade-off with an optimal value in (0,1). Empirically, D^2PO consistently outperforms vanilla DPO across multiple benchmarks and model families, including open-source LLMs, with gains in win rates and reduced verbosity, and it extends gracefully to reference-free on-policy training. The approach offers a practical, robust enhancement to preference-based fine-tuning with broad applicability and a public codebase for replication.

Abstract

Direct Preference Optimization (DPO) has gained attention as an efficient alternative to reinforcement learning from human feedback (RLHF) for aligning large language models (LLMs) with human preferences. Despite its advantages, DPO suffers from a length bias, generating responses longer than those from the reference model. Existing solutions like SimPO and SamPO address this issue but uniformly treat the contribution of rewards across sequences, overlooking temporal dynamics. To this end, we propose an enhanced preference optimization method that incorporates a temporal decay factor controlled by a gamma parameter. This dynamic weighting mechanism adjusts the influence of each reward based on its position in the sequence, prioritizing earlier tokens that are more critical for alignment. By adaptively focusing on more relevant feedback, our approach mitigates overfitting to less pertinent data and remains responsive to evolving human preferences. Experimental results on several benchmarks show that our approach consistently outperforms vanilla DPO by 5.9-8.8 points on AlpacaEval 2 and 3.3-9.7 points on Arena-Hard across different model architectures and sizes. Furthermore, additional experiments on mathematical and reasoning benchmarks (MMLU, GSM8K, and MATH) confirm that our method enhances performance without compromising general capabilities. Our codebase would be available at \url{https://github.com/LotuSrc/D2PO}.

Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective

TL;DR

This work addresses the length bias and suboptimal token-wise contributions in Direct Preference Optimization (DPO) by introducing a temporal decay factor governed by a parameter . The proposed method, D^2PO, applies exponential position weighting to per-token rewards, prioritizing earlier tokens in an autoregressive setting and yielding a tractable loss that remains efficient. The authors provide a token-level MDP analysis and derive an upper bound on suboptimality that reveals a gamma-driven trade-off with an optimal value in (0,1). Empirically, D^2PO consistently outperforms vanilla DPO across multiple benchmarks and model families, including open-source LLMs, with gains in win rates and reduced verbosity, and it extends gracefully to reference-free on-policy training. The approach offers a practical, robust enhancement to preference-based fine-tuning with broad applicability and a public codebase for replication.

Abstract

Direct Preference Optimization (DPO) has gained attention as an efficient alternative to reinforcement learning from human feedback (RLHF) for aligning large language models (LLMs) with human preferences. Despite its advantages, DPO suffers from a length bias, generating responses longer than those from the reference model. Existing solutions like SimPO and SamPO address this issue but uniformly treat the contribution of rewards across sequences, overlooking temporal dynamics. To this end, we propose an enhanced preference optimization method that incorporates a temporal decay factor controlled by a gamma parameter. This dynamic weighting mechanism adjusts the influence of each reward based on its position in the sequence, prioritizing earlier tokens that are more critical for alignment. By adaptively focusing on more relevant feedback, our approach mitigates overfitting to less pertinent data and remains responsive to evolving human preferences. Experimental results on several benchmarks show that our approach consistently outperforms vanilla DPO by 5.9-8.8 points on AlpacaEval 2 and 3.3-9.7 points on Arena-Hard across different model architectures and sizes. Furthermore, additional experiments on mathematical and reasoning benchmarks (MMLU, GSM8K, and MATH) confirm that our method enhances performance without compromising general capabilities. Our codebase would be available at \url{https://github.com/LotuSrc/D2PO}.

Paper Structure

This paper contains 51 sections, 22 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Visualization of KL divergence of instruct models and their DPO variants. The results include three widely used open-source LLMs: Llama3, Gemma2, and Mistral-NeMo. Observation here indicates earlier tokens contribute more during alignment.
  • Figure 2: Illustration of coefficients in DPO, SimPO, SamPO, and our $\textrm{D}^2\textrm{PO}$ across various positions. Each box represents a coefficient, and the opacity denotes the magnitude, with darker colors indicating higher values. (a) For DPO, the coefficients are uniform across different positions. (b) For SimPO, the coefficients of the chosen $y_w$ and the rejected $y_l$ are normlaized by their lengths $|y_w|$ and $|y_l|$, respectively. (c) In SamPO, the coefficients are selected based on the minimum length of $|y_w|$ and $|y_l|$. (d) Our method introduces a $\gamma$ factor to implement coefficient decay, specifically as a sequence defined by $\gamma^t$ (e.g., 1, $\gamma$, $\gamma^2$, ..., $\gamma^T$). Here, we use $\gamma=0.9$ for a clear visualization.
  • Figure 3: Probability against positions on 1000 samples.
  • Figure 4: Reference margin of DPO.
  • Figure 5: Performance against different $\gamma$ choices of three open-source models on three benchmarks.
  • ...and 4 more figures