Table of Contents
Fetching ...

VPO: Leveraging the Number of Votes in Preference Optimization

Jae Hyeon Cho, Minkyung Park, Byung-Jun Lee

TL;DR

This paper introduces the Vote-based Preference Optimization (VPO) framework, which incorporates the number of votes on both sides to distinguish between controversial and clearly preferred generation pairs, and demonstrates that previous algorithms can be extended using the proposed framework, termed VDPO and VIPO.

Abstract

Direct Preference Optimization (DPO) trains a language model using human preference data, bypassing the explicit reward modeling phase of Reinforcement Learning from Human Feedback (RLHF). By iterating over sentence pairs in a preference dataset, DPO enhances generation quality by increasing the likelihood of producing preferred sentences over less favored ones. Preference datasets are typically created by selecting preferred sentences through a voting process involving multiple individuals, as opinions can vary due to the subjective nature of human preferences. While the number of votes offers insight into whether a sentence pair is clearly preferable or controversial, current methods do not fully leverage this information. In this paper, we introduce a technique that leverages user voting data to better align with diverse subjective preferences. We employ the Bayesian Minimum Mean Square Error (Bayesian MMSE) estimator to model the probability that one generation is preferable to another. Using this estimated probability as a target, we develop the Vote-based Preference Optimization (VPO) framework, which incorporates the number of votes on both sides to distinguish between controversial and obvious generation pairs. We show that previous algorithms, such as DPO and Identity Preference Optimization (IPO), can be extended using the proposed framework, termed VDPO and VIPO. Our experiments demonstrate that these proposed algorithms outperform various existing methods, including their base algorithms.

VPO: Leveraging the Number of Votes in Preference Optimization

TL;DR

This paper introduces the Vote-based Preference Optimization (VPO) framework, which incorporates the number of votes on both sides to distinguish between controversial and clearly preferred generation pairs, and demonstrates that previous algorithms can be extended using the proposed framework, termed VDPO and VIPO.

Abstract

Direct Preference Optimization (DPO) trains a language model using human preference data, bypassing the explicit reward modeling phase of Reinforcement Learning from Human Feedback (RLHF). By iterating over sentence pairs in a preference dataset, DPO enhances generation quality by increasing the likelihood of producing preferred sentences over less favored ones. Preference datasets are typically created by selecting preferred sentences through a voting process involving multiple individuals, as opinions can vary due to the subjective nature of human preferences. While the number of votes offers insight into whether a sentence pair is clearly preferable or controversial, current methods do not fully leverage this information. In this paper, we introduce a technique that leverages user voting data to better align with diverse subjective preferences. We employ the Bayesian Minimum Mean Square Error (Bayesian MMSE) estimator to model the probability that one generation is preferable to another. Using this estimated probability as a target, we develop the Vote-based Preference Optimization (VPO) framework, which incorporates the number of votes on both sides to distinguish between controversial and obvious generation pairs. We show that previous algorithms, such as DPO and Identity Preference Optimization (IPO), can be extended using the proposed framework, termed VDPO and VIPO. Our experiments demonstrate that these proposed algorithms outperform various existing methods, including their base algorithms.

Paper Structure

This paper contains 42 sections, 1 theorem, 14 equations, 2 figures, 12 tables.

Key Result

Theorem 1

pishro-nik2014 Bayesian MMSE estimator is solution to the following:

Figures (2)

  • Figure 1: While previous methods trained models to generate responses based on majority preference (e.g., A), human preferences are subjective, making responses like B also desirable. Our proposed framework, VPO, utilizes additional information to capture a more nuanced understanding of these preferences.
  • Figure 2: This figure illustrates the reward margin between preferred and non-preferred responses during the preference alignment of the LLaMA 7B model using four different algorithms on the SHP dataset.

Theorems & Definitions (1)

  • Theorem 1