Table of Contents
Fetching ...

Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

Yueqin Yin, Zhendong Wang, Yi Gu, Hai Huang, Weizhu Chen, Mingyuan Zhou

TL;DR

This work addresses the challenge of aligning LLMs to diverse human preferences beyond identical-prompt contrasts. It introduces Relative Preference Optimization (RPO), a contrastive, cross-prompt learning framework that leverages a contrast matrix and embedding-based weighting to exploit both paired and unpaired preference data. Through extensive experiments on dialogue and summarization tasks across multiple models, RPO demonstrates superior or competitive alignment performance versus DPO, IPO, KTO, and RLHF baselines, with notable gains when using semantically related prompts. The approach offers a scalable, adaptable path toward more nuanced user-aligned AI, while highlighting limitations related to embedding quality, memory constraints, and modeling of Z(x).

Abstract

In the field of large language models (LLMs), aligning models with the diverse preferences of users is a critical challenge. Direct Preference Optimization (DPO) has played a key role in this area. It works by using pairs of preferences derived from the same prompts, and it functions without needing an additional reward model. However, DPO does not fully reflect the complex nature of human learning, which often involves understanding contrasting responses to not only identical but also similar questions. To overcome this shortfall, we propose Relative Preference Optimization (RPO). RPO is designed to discern between more and less preferred responses derived from both identical and related prompts. It introduces a contrastive weighting mechanism, enabling the tuning of LLMs using a broader range of preference data, including both paired and unpaired sets. This approach expands the learning capabilities of the model, allowing it to leverage insights from a more varied set of prompts. Through empirical tests, including dialogue and summarization tasks, and evaluations using the AlpacaEval2.0 leaderboard, RPO has demonstrated a superior ability to align LLMs with user preferences and to improve their adaptability during the training process. Our code can be viewed at https://github.com/yinyueqin/relative-preference-optimization

Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

TL;DR

This work addresses the challenge of aligning LLMs to diverse human preferences beyond identical-prompt contrasts. It introduces Relative Preference Optimization (RPO), a contrastive, cross-prompt learning framework that leverages a contrast matrix and embedding-based weighting to exploit both paired and unpaired preference data. Through extensive experiments on dialogue and summarization tasks across multiple models, RPO demonstrates superior or competitive alignment performance versus DPO, IPO, KTO, and RLHF baselines, with notable gains when using semantically related prompts. The approach offers a scalable, adaptable path toward more nuanced user-aligned AI, while highlighting limitations related to embedding quality, memory constraints, and modeling of Z(x).

Abstract

In the field of large language models (LLMs), aligning models with the diverse preferences of users is a critical challenge. Direct Preference Optimization (DPO) has played a key role in this area. It works by using pairs of preferences derived from the same prompts, and it functions without needing an additional reward model. However, DPO does not fully reflect the complex nature of human learning, which often involves understanding contrasting responses to not only identical but also similar questions. To overcome this shortfall, we propose Relative Preference Optimization (RPO). RPO is designed to discern between more and less preferred responses derived from both identical and related prompts. It introduces a contrastive weighting mechanism, enabling the tuning of LLMs using a broader range of preference data, including both paired and unpaired sets. This approach expands the learning capabilities of the model, allowing it to leverage insights from a more varied set of prompts. Through empirical tests, including dialogue and summarization tasks, and evaluations using the AlpacaEval2.0 leaderboard, RPO has demonstrated a superior ability to align LLMs with user preferences and to improve their adaptability during the training process. Our code can be viewed at https://github.com/yinyueqin/relative-preference-optimization
Paper Structure (34 sections, 21 equations, 2 figures, 11 tables, 1 algorithm)

This paper contains 34 sections, 21 equations, 2 figures, 11 tables, 1 algorithm.

Figures (2)

  • Figure 1: An example illustrates how DPO and RPO utilize contrastive responses with human preferences to achieve model alignment.
  • Figure 2: DPO requires paired preference data derived from identical prompts. RPO can utilize preference data from either the same or different prompts for constructing contrastive samples. Here, $y_w$ represents win responses, and $y_l$ denotes lose responses.