Table of Contents
Fetching ...

Design Considerations in Offline Preference-based RL

Alekh Agarwal, Christoph Dann, Teodor V. Marinov

TL;DR

The paper analyzes offline preference-based RLHF methods through a unified theoretical framework, showing that the learned policy quality hinges on loss curvature, data coverage, and base-policy choices rather than reparameterization arguments alone. It introduces a benchmark policy $\pi^*$ and derives a KL-bound linking empirical loss to policy proximity under realizability and proper-loss assumptions; importantly, squared losses with favorable curvature yield stronger guarantees than logistic losses in this setting. Empirically, squared-loss variants (IPO-style) outperform logistic losses (DPO-style) on TL;DR summarization, and the choice of base policy (e.g., using a reference policy) interacts with stability and performance. The findings suggest design guidance for offline RLHF: prefer well-curved losses, account for coverage, and design data-collection to improve support for high-quality responses, offering a theoretical foundation beyond reparameterization-based arguments.

Abstract

Offline algorithms for Reinforcement Learning from Human Preferences (RLHF), which use only a fixed dataset of sampled responses given an input, and preference feedback among these responses, have gained increasing prominence in the literature on aligning language models. In this paper, we study how the different design choices made in methods such as DPO, IPO, SLiC and many variants influence the quality of the learned policy, from a theoretical perspective. Our treatment yields insights into the choices of loss function, the policy which is used to normalize log-likelihoods, and also the role of the data sampling policy. Notably, our results do not rely on the standard reparameterization-style arguments used to motivate some of the algorithms in this family, which allows us to give a unified treatment to a broad class of methods. We also conduct a small empirical study to verify some of the theoretical findings on a standard summarization benchmark.

Design Considerations in Offline Preference-based RL

TL;DR

The paper analyzes offline preference-based RLHF methods through a unified theoretical framework, showing that the learned policy quality hinges on loss curvature, data coverage, and base-policy choices rather than reparameterization arguments alone. It introduces a benchmark policy and derives a KL-bound linking empirical loss to policy proximity under realizability and proper-loss assumptions; importantly, squared losses with favorable curvature yield stronger guarantees than logistic losses in this setting. Empirically, squared-loss variants (IPO-style) outperform logistic losses (DPO-style) on TL;DR summarization, and the choice of base policy (e.g., using a reference policy) interacts with stability and performance. The findings suggest design guidance for offline RLHF: prefer well-curved losses, account for coverage, and design data-collection to improve support for high-quality responses, offering a theoretical foundation beyond reparameterization-based arguments.

Abstract

Offline algorithms for Reinforcement Learning from Human Preferences (RLHF), which use only a fixed dataset of sampled responses given an input, and preference feedback among these responses, have gained increasing prominence in the literature on aligning language models. In this paper, we study how the different design choices made in methods such as DPO, IPO, SLiC and many variants influence the quality of the learned policy, from a theoretical perspective. Our treatment yields insights into the choices of loss function, the policy which is used to normalize log-likelihoods, and also the role of the data sampling policy. Notably, our results do not rely on the standard reparameterization-style arguments used to motivate some of the algorithms in this family, which allows us to give a unified treatment to a broad class of methods. We also conduct a small empirical study to verify some of the theoretical findings on a standard summarization benchmark.

Paper Structure

This paper contains 17 sections, 4 theorems, 26 equations, 2 figures, 2 tables.

Key Result

Theorem 3.6

For any $\pi \in \Pi$ such that $L_\mu(\pi; D_{xy\omega}) - L_\mu(\pi^\star_\mu; D_{xy\omega}) \leq \epsilon$, where the corresponding loss to $L_\mu$, given by $\ell$ is proper, and under Assumptions ass:uniform_bound-ass:curvature, it holds that

Figures (2)

  • Figure 1: Left panel shows the preference of the learned policy's summaries against those from the initial policy $\pi_{\text{ref}}$, as evaluated by a prompted Gemini 1.0 Ultra model. Shaded regions represent 95% error bands. Both the logistic loss variants quickly improve in terms of the preference scores initially, but then suffer a catastrophic collapse. Squared loss improves at a similar rate initially, and remains stable throughout the training regime. Right panel shows a direct comparison between the variants of logistic loss using $\mu =$ uniform and $\mu = \pi_{\text{ref}}$ (DPO) at regular intervals in the training process. Interestingly, the uniform variant is preferred in the early stages of training, but as the training collapses around the training step 5K, the $\pi_{\text{ref}}$ variant starts to improve. Nevertheless, the absolute performance of both variants reaches its peak earlier in the training and rapidly worsens after 5K steps, suggesting that the preference for $\pi_{\text{ref}}$ over uniform in this region might not be particularly significant. See text for a more nuanced discussion.
  • Figure 2: Evolution of the log-likelihoods of the preferred response (left) and dispreferred response (right) from the preference dataset across the training process. Both variants of the squared loss decrease the log-likelihoods of both the responses during training, but the decrease is relatively mild. The logistic loss, on the other hand, sends these log-likelihoods crashing sharply, even though the dispreferred responses have significantly lower values, so the difference of log-likelihoods remains highly negative, driving the loss to zero. We suspect that this degeneration of log-likelihoods is responsible for the eventual collapse observed for the logistic loss in Figure \ref{['fig:eval']}.

Theorems & Definitions (11)

  • Theorem 3.6
  • Remark 3.7: Choice of loss function
  • Remark 3.8: Choice of base policy
  • Remark 3.9: Effect of constraints
  • Remark 3.10: Connections with prior results
  • Lemma 4.1
  • proof
  • Lemma 4.2
  • proof
  • Lemma 4.3
  • ...and 1 more