Table of Contents
Fetching ...

The Importance of Online Data: Understanding Preference Fine-tuning via Coverage

Yuda Song, Gokul Swamy, Aarti Singh, J. Andrew Bagnell, Wen Sun

TL;DR

The paper provides a principled, coverage-based separation between online RLHF and offline contrastive preference-fine-tuning methods, showing that global coverage is necessary for offline methods to reach the optimal policy while online methods only require a weaker, KL-based local coverage. It introduces HyPO, a hybrid optimizer that uses offline DPO objectives with online KL regularization, achieving better performance and lower reverse KL than purely offline approaches. The work also demonstrates that online methods can guarantee performance under partial coverage when reward functions are bounded, and analyzes extrapolation behavior under function approximation, highlighting the importance of FA for generalization beyond the training data. Collectively, these results guide data collection and algorithm design for robust, scalable preference fine-tuning of LLMs.

Abstract

Learning from human preference data has emerged as the dominant paradigm for fine-tuning large language models (LLMs). The two most common families of techniques -- online reinforcement learning (RL) such as Proximal Policy Optimization (PPO) and offline contrastive methods such as Direct Preference Optimization (DPO) -- were positioned as equivalent in prior work due to the fact that both have to start from the same offline preference dataset. To further expand our theoretical understanding of the similarities and differences between online and offline techniques for preference fine-tuning, we conduct a rigorous analysis through the lens of dataset coverage, a concept that captures how the training data covers the test distribution and is widely used in RL. We prove that a global coverage condition is both necessary and sufficient for offline contrastive methods to converge to the optimal policy, but a weaker partial coverage condition suffices for online RL methods. This separation provides one explanation of why online RL methods can perform better than offline methods, especially when the offline preference data is not diverse enough. Finally, motivated by our preceding theoretical observations, we derive a hybrid preference optimization (HyPO) algorithm that uses offline data for contrastive-based preference optimization and online data for KL regularization. Theoretically and empirically, we demonstrate that HyPO is more performant than its pure offline counterpart DPO, while still preserving its computation and memory efficiency.

The Importance of Online Data: Understanding Preference Fine-tuning via Coverage

TL;DR

The paper provides a principled, coverage-based separation between online RLHF and offline contrastive preference-fine-tuning methods, showing that global coverage is necessary for offline methods to reach the optimal policy while online methods only require a weaker, KL-based local coverage. It introduces HyPO, a hybrid optimizer that uses offline DPO objectives with online KL regularization, achieving better performance and lower reverse KL than purely offline approaches. The work also demonstrates that online methods can guarantee performance under partial coverage when reward functions are bounded, and analyzes extrapolation behavior under function approximation, highlighting the importance of FA for generalization beyond the training data. Collectively, these results guide data collection and algorithm design for robust, scalable preference fine-tuning of LLMs.

Abstract

Learning from human preference data has emerged as the dominant paradigm for fine-tuning large language models (LLMs). The two most common families of techniques -- online reinforcement learning (RL) such as Proximal Policy Optimization (PPO) and offline contrastive methods such as Direct Preference Optimization (DPO) -- were positioned as equivalent in prior work due to the fact that both have to start from the same offline preference dataset. To further expand our theoretical understanding of the similarities and differences between online and offline techniques for preference fine-tuning, we conduct a rigorous analysis through the lens of dataset coverage, a concept that captures how the training data covers the test distribution and is widely used in RL. We prove that a global coverage condition is both necessary and sufficient for offline contrastive methods to converge to the optimal policy, but a weaker partial coverage condition suffices for online RL methods. This separation provides one explanation of why online RL methods can perform better than offline methods, especially when the offline preference data is not diverse enough. Finally, motivated by our preceding theoretical observations, we derive a hybrid preference optimization (HyPO) algorithm that uses offline data for contrastive-based preference optimization and online data for KL regularization. Theoretically and empirically, we demonstrate that HyPO is more performant than its pure offline counterpart DPO, while still preserving its computation and memory efficiency.
Paper Structure (36 sections, 13 theorems, 52 equations, 2 figures, 9 tables, 1 algorithm)

This paper contains 36 sections, 13 theorems, 52 equations, 2 figures, 9 tables, 1 algorithm.

Key Result

proposition 1

Denote $\pi_{\mathsf{ref}}$ as any reference policy such that assump:global breaks. Let $\Pi_{\textsf{dpo}}$ be the set of DPO returned policies such that assump:reward_learning holds. Then there exists policy $\pi \in \Pi_{\textsf{dpo}}$ such that $J(\pi) = -\infty$.

Figures (2)

  • Figure 1: Mean validation reverse KL to the reference policy when DPO and HyPO are trained for 5 epoch on the TL;DR dataset. We repeat the experiment for 3 random seeds and plot the median and the shaded areas denote the min and max over the 3 repetitions.
  • Figure 2: Left and middle: Extrapolation behavior of Online RL method and DPO under linear function approximation (FA). We plot the mean log probability of the preferred responses and the log probability of the best response, which is unseen in the training data. We see that both algorithms correctly assigns increasing probability to the best response. Right: Extrapolation behavior of DPO without function approximation. We plot the average probability of out-of-distribution responses along the training and DPO assigns increasing probability to out-of-distribution responses.

Theorems & Definitions (15)

  • remark 1
  • proposition 1
  • proposition 2: Informal
  • remark 2
  • theorem 1
  • lemma 1
  • theorem 2
  • lemma 2: Objective decomposition
  • lemma 3: Lemma C.2 from chang2024dataset
  • proposition 3: Formal version of of prop:ipo_partial
  • ...and 5 more