Table of Contents
Fetching ...

Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness

Jian Li, Haojing Huang, Yujia Zhang, Pengfei Xu, Xi Chen, Rui Song, Lida Shi, Jingwen Wang, Hao Xu

TL;DR

A novel Self-supervised Preference Optimization (SPO) framework is proposed, which constructs a self-supervised preference degree loss combined with the alignment loss, thereby helping LLMs improve their ability to understand the degree of preference.

Abstract

Recently, there has been significant interest in replacing the reward model in Reinforcement Learning with Human Feedback (RLHF) methods for Large Language Models (LLMs), such as Direct Preference Optimization (DPO) and its variants. These approaches commonly use a binary cross-entropy mechanism on pairwise samples, i.e., minimizing and maximizing the loss based on preferred or dis-preferred responses, respectively. However, while this training strategy omits the reward model, it also overlooks the varying preference degrees within different responses. We hypothesize that this is a key factor hindering LLMs from sufficiently understanding human preferences. To address this problem, we propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss, thereby helping LLMs improve their ability to understand the degree of preference. Extensive experiments are conducted on two widely used datasets of different tasks. The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods and significantly boost their performance to achieve state-of-the-art performance. We also conduct detailed analyses to offer comprehensive insights into SPO, which verifies its effectiveness. The code is available at https://github.com/lijian16/SPO.

Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness

TL;DR

A novel Self-supervised Preference Optimization (SPO) framework is proposed, which constructs a self-supervised preference degree loss combined with the alignment loss, thereby helping LLMs improve their ability to understand the degree of preference.

Abstract

Recently, there has been significant interest in replacing the reward model in Reinforcement Learning with Human Feedback (RLHF) methods for Large Language Models (LLMs), such as Direct Preference Optimization (DPO) and its variants. These approaches commonly use a binary cross-entropy mechanism on pairwise samples, i.e., minimizing and maximizing the loss based on preferred or dis-preferred responses, respectively. However, while this training strategy omits the reward model, it also overlooks the varying preference degrees within different responses. We hypothesize that this is a key factor hindering LLMs from sufficiently understanding human preferences. To address this problem, we propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss, thereby helping LLMs improve their ability to understand the degree of preference. Extensive experiments are conducted on two widely used datasets of different tasks. The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods and significantly boost their performance to achieve state-of-the-art performance. We also conduct detailed analyses to offer comprehensive insights into SPO, which verifies its effectiveness. The code is available at https://github.com/lijian16/SPO.
Paper Structure (31 sections, 14 equations, 6 figures, 8 tables)

This paper contains 31 sections, 14 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: The architecture of our proposed Self-supervised Preference Optimization (SPO) method involves employing an extractor to identify key content within the outputs of LLMs. Subsequently, self-supervised modules dedicated to preference and dis-preference randomly remove this content and undertake classification tasks. Ultimately, the loss derived from the classification is integrated with the alignment loss to jointly optimize the LLM.
  • Figure 2: Comparison of win rates with different state-of-the-art methods on TL;DR and Anthropic-HH datasets of three LLMs, i.e., LLaMA-7B, LLaMA-13B and Mistral-7B.
  • Figure 3: Analysis of the relationship between key content and preferences on TL;DR and Antropic HH datasets.
  • Figure 4: The impact of self-supervised classification numbers on the performance. LLaMA-7B and 13B with DPO (+SPO) are trained on TL;DR dataset.
  • Figure 5: The impact of the weight $\gamma$ on the performance. LLaMA-7B and 13B with DPO (+SPO) are trained on the TL;DR dataset.
  • ...and 1 more figures