Table of Contents
Fetching ...

Uncertainty-Penalized Direct Preference Optimization

Sam Houliston, Alizée Pace, Alexander Immer, Gunnar Rätsch

TL;DR

A pessimistic framework for DPO is developed by introducing preference uncertainty penalization schemes, inspired by offline reinforcement learning, which shows improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.

Abstract

Aligning Large Language Models (LLMs) to human preferences in content, style, and presentation is challenging, in part because preferences are varied, context-dependent, and sometimes inherently ambiguous. While successful, Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are prone to the issue of proxy reward overoptimization. Analysis of the DPO loss reveals a critical need for regularization for mislabeled or ambiguous preference pairs to avoid reward hacking. In this work, we develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes, inspired by offline reinforcement learning. The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples. Evaluation of the methods is performed with GPT2 Medium on the Anthropic-HH dataset using a model ensemble to obtain uncertainty estimates, and shows improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.

Uncertainty-Penalized Direct Preference Optimization

TL;DR

A pessimistic framework for DPO is developed by introducing preference uncertainty penalization schemes, inspired by offline reinforcement learning, which shows improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.

Abstract

Aligning Large Language Models (LLMs) to human preferences in content, style, and presentation is challenging, in part because preferences are varied, context-dependent, and sometimes inherently ambiguous. While successful, Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are prone to the issue of proxy reward overoptimization. Analysis of the DPO loss reveals a critical need for regularization for mislabeled or ambiguous preference pairs to avoid reward hacking. In this work, we develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes, inspired by offline reinforcement learning. The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples. Evaluation of the methods is performed with GPT2 Medium on the Anthropic-HH dataset using a model ensemble to obtain uncertainty estimates, and shows improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.

Paper Structure

This paper contains 58 sections, 41 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: $\mathcal{L}_{\text{DPO}}$ vs $A_\theta$
  • Figure 2: Rewards over completions for 500 Anthropic-HH test prompts.
  • Figure 3: Study of robustness to reward overoptimization.
  • Figure 4: Statistics of reward scores by the model ensemble on Anthropic-HH test dataset.
  • Figure 5: Model completion scores on 500 Anthropic-HH test prompts. Dataset chosen response obtains highest score, followed by multiplication penalty scheme. Improvement in scores from Pretrained, to SFT, to DPO Baseline confirm a valid training of the DPO baseline.
  • ...and 1 more figures