Cal-DPO: Calibrated Direct Preference Optimization for Language Model Alignment
Teng Xiao, Yige Yuan, Huaisheng Zhu, Mingxiao Li, Vasant G Honavar
TL;DR
Calibrated Direct Preference Optimization (Cal-DPO) addresses a key gap in offline contrastive preference learning by calibrating the learned implicit rewards to the ground-truth reward scale, preventing degradation of the chosen-response likelihood. By adding a simple squared calibration term to the standard DPO objective, Cal-DPO aligns $\log \frac{\pi_\theta(\boldsymbol{y}|\boldsymbol{x})}{\pi_{\mathrm{ref}}(\boldsymbol{y}|\boldsymbol{x})}$ with $\frac{r(\boldsymbol{x},\boldsymbol{y})}{\beta}$, while preserving the contrastive learning signal. The approach comes with theoretical guarantees showing that Cal-DPO incorporates forward KL-like optimization and upper-bounds the RLHF objective, driving mode-seeking behavior and enabling robust alignment. Empirically, Cal-DPO yields strong improvements over DPO and other baselines across reasoning, summarization, and dialogue benchmarks, and it generalizes to calibrated variants of IPO and SLiC, with improved alignment to human preferences in both zero-shot GPT-4 evaluations and controlled reward settings. The method remains offline-only in this work, with future work suggested on extending calibration to on-policy regimes and broader safety and fairness considerations.
Abstract
We study the problem of aligning large language models (LLMs) with human preference data. Contrastive preference optimization has shown promising results in aligning LLMs with available preference data by optimizing the implicit reward associated with the policy. However, the contrastive objective focuses mainly on the relative values of implicit rewards associated with two responses while ignoring their actual values, resulting in suboptimal alignment with human preferences. To address this limitation, we propose calibrated direct preference optimization (Cal-DPO), a simple yet effective algorithm. We show that substantial improvement in alignment with the given preferences can be achieved simply by calibrating the implicit reward to ensure that the learned implicit rewards are comparable in scale to the ground-truth rewards. We demonstrate the theoretical advantages of Cal-DPO over existing approaches. The results of our experiments on a variety of standard benchmarks show that Cal-DPO remarkably improves off-the-shelf methods.
