Table of Contents
Fetching ...

Preference Alignment with Flow Matching

Minu Kim, Yongsik Lee, Sehyeok Kang, Jihwan Oh, Song Chong, Se-Young Yun

TL;DR

Preference Flow Matching is presented, a new framework for preference-based reinforcement learning (PbRL) that streamlines the integration of preferences into an arbitrary class of pre-trained models and effectively aligns model outputs with human preferences without relying on explicit or implicit reward function estimation.

Abstract

We present Preference Flow Matching (PFM), a new framework for preference-based reinforcement learning (PbRL) that streamlines the integration of preferences into an arbitrary class of pre-trained models. Existing PbRL methods require fine-tuning pre-trained models, which presents challenges such as scalability, inefficiency, and the need for model modifications, especially with black-box APIs like GPT-4. In contrast, PFM utilizes flow matching techniques to directly learn from preference data, thereby reducing the dependency on extensive fine-tuning of pre-trained models. By leveraging flow-based models, PFM transforms less preferred data into preferred outcomes, and effectively aligns model outputs with human preferences without relying on explicit or implicit reward function estimation, thus avoiding common issues like overfitting in reward models. We provide theoretical insights that support our method's alignment with standard PbRL objectives. Experimental results indicate the practical effectiveness of our method, offering a new direction in aligning a pre-trained model to preference. Our code is available at https://github.com/jadehaus/preference-flow-matching.

Preference Alignment with Flow Matching

TL;DR

Preference Flow Matching is presented, a new framework for preference-based reinforcement learning (PbRL) that streamlines the integration of preferences into an arbitrary class of pre-trained models and effectively aligns model outputs with human preferences without relying on explicit or implicit reward function estimation.

Abstract

We present Preference Flow Matching (PFM), a new framework for preference-based reinforcement learning (PbRL) that streamlines the integration of preferences into an arbitrary class of pre-trained models. Existing PbRL methods require fine-tuning pre-trained models, which presents challenges such as scalability, inefficiency, and the need for model modifications, especially with black-box APIs like GPT-4. In contrast, PFM utilizes flow matching techniques to directly learn from preference data, thereby reducing the dependency on extensive fine-tuning of pre-trained models. By leveraging flow-based models, PFM transforms less preferred data into preferred outcomes, and effectively aligns model outputs with human preferences without relying on explicit or implicit reward function estimation, thus avoiding common issues like overfitting in reward models. We provide theoretical insights that support our method's alignment with standard PbRL objectives. Experimental results indicate the practical effectiveness of our method, offering a new direction in aligning a pre-trained model to preference. Our code is available at https://github.com/jadehaus/preference-flow-matching.
Paper Structure (24 sections, 4 theorems, 24 equations, 5 figures, 6 tables, 1 algorithm)

This paper contains 24 sections, 4 theorems, 24 equations, 5 figures, 6 tables, 1 algorithm.

Key Result

Theorem 4.1

Let $p_{1}$ denote the marginal distribution of the positively-labeled samples $y^{+}$. Then the marginal distribution $p_{1}$ obtained from the preference model $\mathbb{P}(y > y' | x)$ is an optimizer of the optimization problem

Figures (5)

  • Figure 1: Illustration of our PFM framework. In the typical RLHF scenarios (left), we first sample preference data from the supervised fine-tuned (SFT) reference model. A reward model is learned from the collected dataset, either implicitly (as in DPO) or explicitly. The reward model is then used to fine-tune the reference policy to obtain the final model. Our method directly learns the preference flow from the collected preference data, where the flow is represented as a vector field $v_{\theta}$ (middle). For inference, we again sample a point from the reference policy, and improve the quality of alignment by using the trained flow matching model, without the need of fine-tuning the existing reference model (right).
  • Figure 2: Comparison of RLHF, DPO, and PFM on a 2-dimensional toy experiment. We generate preference labels from a ground truth reward in (a) and a pre-trained Gaussian reference policy (c). Both the RLHF (e) and DPO (f) methods struggle to align with the preferences, due to the overfitted reward model (b), even with the presence of KL regularizer ($\beta = 1$). PFM is able to mimic the distribution of the positively-labeled samples (d), and therefore achieves the highest performance (g). Repeating PFM iteratively to the marginal samples can further improve the alignment with the preference (h).
  • Figure 3: Comparison of RLHF, DPO, and PFM on a conditional MNIST image generation task. Numbers represent the preference score. PFM (d) demonstrates superior sample quality and preference alignment compared to RLHF (e) and DPO (f), where DPO collapses with a small size of $\beta$ (g). The iterative PFM with only two iterations (h) results in almost perfectly aligning with the preferences.
  • Figure 4: Distribution of preference scores for each method. Left visualizes the distribution of scores for the pre-trained reference policy and PFM-attached policy. Without fine-tuning the reference policy, PFM can obtain substantially better results by only adding a small flow-matching module. Right visualizes the preference score distribution of the RLHF (PPO) fine-tuned policy, and the PFM added policy to the PPO fine-tuned policy. Note that PFM is trained with the original dataset, not by the dataset generated from the PPO fine-tuned policy.
  • Figure 5: Analysis of a sample episode of a DPO fine-tuned model on the MuJoCo ant environment. DPO fine-tuned model often overestimates the reward due to reward overfitting (e.g., $t=196$). This can cause the policy to choose problematic actions. Here, the implicit reward estimation is $\hat{r}_{\theta}(s, a) = \beta\log(\pi_{\theta}(a|s) / \pi_{\mathrm{ref}}(a|s))$.

Theorems & Definitions (6)

  • Theorem 4.1: Characterization of the Marginal
  • Theorem 4.2: Convergence of Iterative Method
  • Theorem C.1: Characterization of the Marginal
  • proof
  • Theorem C.2: Convergence of Iterative Method
  • proof