Table of Contents
Fetching ...

LIRE: listwise reward enhancement for preference alignment

Mingye Zhu, Yi Liu, Lei Zhang, Junbo Guo, Zhendong Mao

TL;DR

LIRE introduces a gradient-based, offline listwise objective for preference alignment that leverages multiple responses per query and an offline reward signal. By formulating a temperature-smoothed distribution over all responses and optimizing the expected reward under this distribution, LIRE distributes learning signals across high- and low-reward candidates, improving stability and efficiency. A self-enhancement loop (Evolve/Iterate) further refines rewards through iterative data generation and policy updates, yielding strong performance on dialogue and summarization benchmarks and good transfer to out-of-distribution data. The approach achieves superior results against pairwise baselines, maintains close alignment to a reference policy, and offers practical benefits such as reduced online sampling and straightforward implementation. Overall, LIRE advances preference alignment by exploiting listwise information, enhancing robustness across multiple evaluation modalities and reward models, with implications for scalable, safer LLM deployment.

Abstract

Recently, tremendous strides have been made to align the generation of Large Language Models (LLMs) with human values to mitigate toxic or unhelpful content. Leveraging Reinforcement Learning from Human Feedback (RLHF) proves effective and is widely adopted by researchers. However, implementing RLHF is complex, and its sensitivity to hyperparameters renders achieving stable performance and scalability challenging. Furthermore, prevailing approaches to preference alignment primarily concentrate on pairwise comparisons, with limited exploration into multi-response scenarios, thereby overlooking the potential richness within the candidate pool. For the above reasons, we propose a new approach: Listwise Reward Enhancement for Preference Alignment (LIRE), a gradient-based reward optimization approach that incorporates the offline rewards of multiple responses into a streamlined listwise framework, thus eliminating the need for online sampling during training. LIRE is straightforward to implement, requiring minimal parameter tuning, and seamlessly aligns with the pairwise paradigm while naturally extending to multi-response scenarios. Moreover, we introduce a self-enhancement algorithm aimed at iteratively refining the reward during training. Our experiments demonstrate that LIRE consistently outperforms existing methods across several benchmarks on dialogue and summarization tasks, with good transferability to out-of-distribution data, assessed using proxy reward models and human annotators.

LIRE: listwise reward enhancement for preference alignment

TL;DR

LIRE introduces a gradient-based, offline listwise objective for preference alignment that leverages multiple responses per query and an offline reward signal. By formulating a temperature-smoothed distribution over all responses and optimizing the expected reward under this distribution, LIRE distributes learning signals across high- and low-reward candidates, improving stability and efficiency. A self-enhancement loop (Evolve/Iterate) further refines rewards through iterative data generation and policy updates, yielding strong performance on dialogue and summarization benchmarks and good transfer to out-of-distribution data. The approach achieves superior results against pairwise baselines, maintains close alignment to a reference policy, and offers practical benefits such as reduced online sampling and straightforward implementation. Overall, LIRE advances preference alignment by exploiting listwise information, enhancing robustness across multiple evaluation modalities and reward models, with implications for scalable, safer LLM deployment.

Abstract

Recently, tremendous strides have been made to align the generation of Large Language Models (LLMs) with human values to mitigate toxic or unhelpful content. Leveraging Reinforcement Learning from Human Feedback (RLHF) proves effective and is widely adopted by researchers. However, implementing RLHF is complex, and its sensitivity to hyperparameters renders achieving stable performance and scalability challenging. Furthermore, prevailing approaches to preference alignment primarily concentrate on pairwise comparisons, with limited exploration into multi-response scenarios, thereby overlooking the potential richness within the candidate pool. For the above reasons, we propose a new approach: Listwise Reward Enhancement for Preference Alignment (LIRE), a gradient-based reward optimization approach that incorporates the offline rewards of multiple responses into a streamlined listwise framework, thus eliminating the need for online sampling during training. LIRE is straightforward to implement, requiring minimal parameter tuning, and seamlessly aligns with the pairwise paradigm while naturally extending to multi-response scenarios. Moreover, we introduce a self-enhancement algorithm aimed at iteratively refining the reward during training. Our experiments demonstrate that LIRE consistently outperforms existing methods across several benchmarks on dialogue and summarization tasks, with good transferability to out-of-distribution data, assessed using proxy reward models and human annotators.
Paper Structure (27 sections, 15 equations, 8 figures, 9 tables, 1 algorithm)

This paper contains 27 sections, 15 equations, 8 figures, 9 tables, 1 algorithm.

Figures (8)

  • Figure 1: Training pipeline of the proposed LIRE framework. The candidate pool is initially constructed by gathering responses $A$ with arbitrary policy $\pi_{\mathbf{\theta}_{init}}$. Subsequently, the scored responses with their query are optimized in a listwise manner. The "dashed" line means it is optional to re-initialize the updated model $\pi_\mathbf{\theta}$ as the sampling policy and generates fresh responses that substitute the prior ones within the candidate pool.
  • Figure 2: Summarization win rate against human-written baselines. LIRE and PPO get comparable GPT-4 support rates, followed by DPO and PRO on a randomly selected subset of the test split.
  • Figure 3: Radar plot of the MT-Bench with GPT-4 as a Judge. This plot gives a clear visual representation of the score distribution across distinct categories for various methodologies. The numbers beside the names are the summed scores. LIRE and PPO maintain relatively more comprehensive performance, indicating their generalization ability when transferred to out-of-distribution data.
  • Figure 4: Win rate evolution when increasing sequence number. As sequence number increases, both LIRE and Best-of-n witness an improvement of win rates calculated by RM. When evaluating with RM$^{*}$, Best-of-$n$ showcases a more significant performance decline, suggesting that Best-of-$n$ gives results that largely align with the preference of RM, while may not catering to the taste of another RM$^{*}$ to a great extent.
  • Figure 5: Reward-KL frontiers of different algorithms. The plot illustrates that LIRE provides good rewards while maintaining relatively small KL.
  • ...and 3 more figures