Table of Contents
Fetching ...

Cal-DPO: Calibrated Direct Preference Optimization for Language Model Alignment

Teng Xiao, Yige Yuan, Huaisheng Zhu, Mingxiao Li, Vasant G Honavar

TL;DR

Calibrated Direct Preference Optimization (Cal-DPO) addresses a key gap in offline contrastive preference learning by calibrating the learned implicit rewards to the ground-truth reward scale, preventing degradation of the chosen-response likelihood. By adding a simple squared calibration term to the standard DPO objective, Cal-DPO aligns $\log \frac{\pi_\theta(\boldsymbol{y}|\boldsymbol{x})}{\pi_{\mathrm{ref}}(\boldsymbol{y}|\boldsymbol{x})}$ with $\frac{r(\boldsymbol{x},\boldsymbol{y})}{\beta}$, while preserving the contrastive learning signal. The approach comes with theoretical guarantees showing that Cal-DPO incorporates forward KL-like optimization and upper-bounds the RLHF objective, driving mode-seeking behavior and enabling robust alignment. Empirically, Cal-DPO yields strong improvements over DPO and other baselines across reasoning, summarization, and dialogue benchmarks, and it generalizes to calibrated variants of IPO and SLiC, with improved alignment to human preferences in both zero-shot GPT-4 evaluations and controlled reward settings. The method remains offline-only in this work, with future work suggested on extending calibration to on-policy regimes and broader safety and fairness considerations.

Abstract

We study the problem of aligning large language models (LLMs) with human preference data. Contrastive preference optimization has shown promising results in aligning LLMs with available preference data by optimizing the implicit reward associated with the policy. However, the contrastive objective focuses mainly on the relative values of implicit rewards associated with two responses while ignoring their actual values, resulting in suboptimal alignment with human preferences. To address this limitation, we propose calibrated direct preference optimization (Cal-DPO), a simple yet effective algorithm. We show that substantial improvement in alignment with the given preferences can be achieved simply by calibrating the implicit reward to ensure that the learned implicit rewards are comparable in scale to the ground-truth rewards. We demonstrate the theoretical advantages of Cal-DPO over existing approaches. The results of our experiments on a variety of standard benchmarks show that Cal-DPO remarkably improves off-the-shelf methods.

Cal-DPO: Calibrated Direct Preference Optimization for Language Model Alignment

TL;DR

Calibrated Direct Preference Optimization (Cal-DPO) addresses a key gap in offline contrastive preference learning by calibrating the learned implicit rewards to the ground-truth reward scale, preventing degradation of the chosen-response likelihood. By adding a simple squared calibration term to the standard DPO objective, Cal-DPO aligns with , while preserving the contrastive learning signal. The approach comes with theoretical guarantees showing that Cal-DPO incorporates forward KL-like optimization and upper-bounds the RLHF objective, driving mode-seeking behavior and enabling robust alignment. Empirically, Cal-DPO yields strong improvements over DPO and other baselines across reasoning, summarization, and dialogue benchmarks, and it generalizes to calibrated variants of IPO and SLiC, with improved alignment to human preferences in both zero-shot GPT-4 evaluations and controlled reward settings. The method remains offline-only in this work, with future work suggested on extending calibration to on-policy regimes and broader safety and fairness considerations.

Abstract

We study the problem of aligning large language models (LLMs) with human preference data. Contrastive preference optimization has shown promising results in aligning LLMs with available preference data by optimizing the implicit reward associated with the policy. However, the contrastive objective focuses mainly on the relative values of implicit rewards associated with two responses while ignoring their actual values, resulting in suboptimal alignment with human preferences. To address this limitation, we propose calibrated direct preference optimization (Cal-DPO), a simple yet effective algorithm. We show that substantial improvement in alignment with the given preferences can be achieved simply by calibrating the implicit reward to ensure that the learned implicit rewards are comparable in scale to the ground-truth rewards. We demonstrate the theoretical advantages of Cal-DPO over existing approaches. The results of our experiments on a variety of standard benchmarks show that Cal-DPO remarkably improves off-the-shelf methods.

Paper Structure

This paper contains 29 sections, 2 theorems, 34 equations, 6 figures, 13 tables, 1 algorithm.

Key Result

Theorem 1

Minimizing the first term in our Cal-DPOin Equation Eq:gener is equivalent to minimizing the forward KL divergence, or equivalently MLE in Equation Eq:MLE, while maintaining the following contrastive negative gradient with respect to $\pi_{\theta}$:

Figures (6)

  • Figure 1: The implicit reward dynamics during training of DPO and Cal-DPO on UltraFeedback data with the base model Zephyr-7b-sft reveal that the rewards for rejected data continuously decrease, while the margins between chosen and rejected data keep increasing. However, in DPO, the rewards for chosen data decrease below zero, whereas in our Cal-DPO , they keep increasing and remain positive. Our Cal-DPO significantly outperforms DPO across reasoning benchmarks. More results on other datasets are provided in Section \ref{['Sec:exp']}.
  • Figure 2: AlpacaEval 2.0 evaluation results of models trained with UltraFeedback Binarized dataset. The DPO and Cal-DPO are both initialized from the SFT model zephyr-7b-sft-full.
  • Figure 3: (Left two) The training dynamics of DPO and Cal-DPO on the TL;DR Summarization dataset. (Right) The performance of SLiC and IPO, and their calibrated counterparts Cal-IPO and Cal-SLiC. We provide additional results on the Anthropic-HH and IMDb datasets in Appendix \ref{['app:results']}.
  • Figure 4: The effect of the coefficient parameter $\beta$ on four tasks. For Reddit TL;DR summarization and Anthropic-HH, we show the win rate over the chosen response.
  • Figure 5: The performance of SLiC and IPO, and their calibrated counterparts Cal-IPO and Cal-SLiC by applying our proposed calibration objective on the Anthropic-HH dataset.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Definition 1
  • Theorem 1
  • Theorem 2
  • proof
  • proof