Table of Contents
Fetching ...

Preference Optimization via Contrastive Divergence: Your Reward Model is Secretly an NLL Estimator

Zhuotong Chen, Fang Liu, Xuan Zhu, Yanjun Qi, Mohammad Ghavamzadeh

TL;DR

This work reframes preference optimization as negative log-likelihood estimation and leverages a Monte Carlo contrastive-divergence (CD) approach to sample dispreferred completions. It introduces MC-PO as an offline algorithm and OnMC-PO as an online extension, using a CD-based MC kernel to generate hard negatives proportional to the current reward model. The authors prove that sampling the preferred completion from the target policy yields an unbiased gradient estimator for the normalization constant in online settings and demonstrate superior performance on standard alignment benchmarks compared to state-of-the-art PO methods. While CD-based sampling increases training time, the results show meaningful gains in model alignment, with future work focusing on multi-step MCMC and efficiency improvements to further reduce overhead.

Abstract

Existing studies on preference optimization (PO) have centered on constructing pairwise preference data following simple heuristics, such as maximizing the margin between preferred and dispreferred completions based on human (or AI) ranked scores. However, none of these heuristics has a full theoretical justification. In this work, we develop a novel PO framework that provides theoretical guidance to effectively sample dispreferred completions. To achieve this, we formulate PO as minimizing the negative log-likelihood (NLL) of a probability model and propose to estimate its normalization constant via a sampling strategy. As we will demonstrate, these estimative samples can act as dispreferred completions in PO. We then select contrastive divergence (CD) as the sampling strategy, and propose a novel MC-PO algorithm that applies the Monte Carlo (MC) kernel from CD to sample hard negatives w.r.t. the parameterized reward model. Finally, we propose the OnMC-PO algorithm, an extension of MC-PO to the online setting. On popular alignment benchmarks, MC-PO outperforms existing SOTA baselines, and OnMC-PO leads to further improvement.

Preference Optimization via Contrastive Divergence: Your Reward Model is Secretly an NLL Estimator

TL;DR

This work reframes preference optimization as negative log-likelihood estimation and leverages a Monte Carlo contrastive-divergence (CD) approach to sample dispreferred completions. It introduces MC-PO as an offline algorithm and OnMC-PO as an online extension, using a CD-based MC kernel to generate hard negatives proportional to the current reward model. The authors prove that sampling the preferred completion from the target policy yields an unbiased gradient estimator for the normalization constant in online settings and demonstrate superior performance on standard alignment benchmarks compared to state-of-the-art PO methods. While CD-based sampling increases training time, the results show meaningful gains in model alignment, with future work focusing on multi-step MCMC and efficiency improvements to further reduce overhead.

Abstract

Existing studies on preference optimization (PO) have centered on constructing pairwise preference data following simple heuristics, such as maximizing the margin between preferred and dispreferred completions based on human (or AI) ranked scores. However, none of these heuristics has a full theoretical justification. In this work, we develop a novel PO framework that provides theoretical guidance to effectively sample dispreferred completions. To achieve this, we formulate PO as minimizing the negative log-likelihood (NLL) of a probability model and propose to estimate its normalization constant via a sampling strategy. As we will demonstrate, these estimative samples can act as dispreferred completions in PO. We then select contrastive divergence (CD) as the sampling strategy, and propose a novel MC-PO algorithm that applies the Monte Carlo (MC) kernel from CD to sample hard negatives w.r.t. the parameterized reward model. Finally, we propose the OnMC-PO algorithm, an extension of MC-PO to the online setting. On popular alignment benchmarks, MC-PO outperforms existing SOTA baselines, and OnMC-PO leads to further improvement.

Paper Structure

This paper contains 42 sections, 9 theorems, 44 equations, 2 figures, 4 tables, 1 algorithm.

Key Result

Proposition 2.0

Suppose that we have $\mathbf{y}_0 \sim \pi^{\ast}(\mathbf{y} | \mathbf{x})$, and $M$ noisy samples $\{\mathbf{y}_i\}_{i=1}^M$, where each $\mathbf{y}_i$ is sampled from a proposal distribution, $\mathbf{y}_i \sim \mu(\mathbf{y} | \mathbf{x})$. Then RNCE approximates the NLL estimation as follows,

Figures (2)

  • Figure 1: Left: existing studies choose a dispreferred completion as the one that maximizes the gap with the preferred completion based on human (or AI) ranked scores. Right: we propose theoretical guidance to sample dispreferred completion(s) proportionally to the parameterized reward model. As the parameters evolve during training, the sampling of dispreferred completion changes.
  • Figure 2: Winrate evaluation of the optimized https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-SFT model using MC-PO, versus its Max, and Min sampling based variants. Five modified https://huggingface.co/datasets/berkeley-nest/Nectar datasets are used for training. $x$ negs represents that the training dataset contains $x$ negative candidates for each input prompt. For example, the $3$ negs dataset is constructed by removing rank-$2$ and rank-$3$ completions from the https://huggingface.co/datasets/berkeley-nest/Nectar dataset.

Theorems & Definitions (14)

  • Proposition 2.0
  • Lemma 3.0
  • Lemma 3.0
  • Proposition 3.0
  • Proposition 1.0
  • proof
  • Lemma 1.0
  • proof
  • Lemma 1.0
  • proof
  • ...and 4 more