Table of Contents
Fetching ...

LRHP: Learning Representations for Human Preferences via Preference Pairs

Chenglong Wang, Yang Gan, Yifu Huo, Yongyu Mu, Qiaozhi He, Murun Yang, Tong Xiao, Chunliang Zhang, Tongran Liu, Jingbo Zhu

TL;DR

A preference representation learning task that aims to construct a richer and more structured representation of human preferences and develops a more generalizable framework, Learning Representations for Human Preferences via preference pairs (namely LRHP), which extends beyond traditional reward modeling to tackle this task.

Abstract

To improve human-preference alignment training, current research has developed numerous preference datasets consisting of preference pairs labeled as "preferred" or "dispreferred". These preference pairs are typically used to encode human preferences into a single numerical value through reward modeling, which acts as a reward signal during reinforcement learning from human feedback (RLHF). However, representing these human preferences as a numerical value complicates the analysis of these preferences and restricts their broader applications other than RLHF. In contrast, in this work, we introduce a preference representation learning task that aims to construct a richer and more structured representation of human preferences. We further develop a more generalizable framework, Learning Representations for Human Preferences via preference pairs (namely LRHP), which extends beyond traditional reward modeling to tackle this task. We verify the utility of preference representations in two downstream tasks: preference data selection and preference margin prediction. Building upon the human preferences in representations, we achieve strong performance in both tasks, significantly outperforming baselines.

LRHP: Learning Representations for Human Preferences via Preference Pairs

TL;DR

A preference representation learning task that aims to construct a richer and more structured representation of human preferences and develops a more generalizable framework, Learning Representations for Human Preferences via preference pairs (namely LRHP), which extends beyond traditional reward modeling to tackle this task.

Abstract

To improve human-preference alignment training, current research has developed numerous preference datasets consisting of preference pairs labeled as "preferred" or "dispreferred". These preference pairs are typically used to encode human preferences into a single numerical value through reward modeling, which acts as a reward signal during reinforcement learning from human feedback (RLHF). However, representing these human preferences as a numerical value complicates the analysis of these preferences and restricts their broader applications other than RLHF. In contrast, in this work, we introduce a preference representation learning task that aims to construct a richer and more structured representation of human preferences. We further develop a more generalizable framework, Learning Representations for Human Preferences via preference pairs (namely LRHP), which extends beyond traditional reward modeling to tackle this task. We verify the utility of preference representations in two downstream tasks: preference data selection and preference margin prediction. Building upon the human preferences in representations, we achieve strong performance in both tasks, significantly outperforming baselines.
Paper Structure (50 sections, 5 equations, 12 figures, 4 tables)

This paper contains 50 sections, 5 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: The overview of LRHP. The left panel illustrates learning representations for human preferences via comparison pairs. The right panel shows two downstream tasks designed to verify these preference representations.
  • Figure 2: To improve the preference-specific reward model using PDS, we first pre-train the model on selected preference data, followed by fine-tuning it with a limited amount of preference-specific data. This strategy allows the reward model to inherit the same preferences and reduce the need for preference-specific data.
  • Figure 3: We report the performance of the PDS task using the LLaMA-3-8B-Instruction model, with the helpfulness win rate on the x-axis and the harmlessness win rate on the y-axis. Experiments are conducted using three different seeds, and we report the average results along with their standard deviations.
  • Figure 4: We also report the performance of the PDS task using the Mistral-7B-Instruction model.
  • Figure 5: Comparison of PMP-SFT and PMP-LRHP across different numbers of labeled samples.
  • ...and 7 more figures