Table of Contents
Fetching ...

Reward Difference Optimization For Sample Reweighting In Offline RLHF

Shiqi Wang, Zhengze Zhang, Rui Zhao, Fei Tan, Cam Tu Nguyen

TL;DR

This work proposes a simple yet effective solution called Reward Difference Optimization, shorted as RDO, which introduces reward difference coefficients to reweigh sample pairs in offline RLHF and develops a difference model which captures rich interactions between a pair of responses for predicting these difference coefficients.

Abstract

With the rapid advances in Large Language Models (LLMs), aligning LLMs with human preferences become increasingly important. Although Reinforcement Learning with Human Feedback (RLHF) proves effective, it is complicated and highly resource-intensive. As such, offline RLHF has been introduced as an alternative solution, which directly optimizes LLMs with ranking losses on a fixed preference dataset. Current offline RLHF only captures the "ordinal relationship" between responses, overlooking the crucial aspect of how much one is preferred over the others. To address this issue, we propose a simple yet effective solution called Reward Difference Optimization, shorted as RDO. Specifically, we introduce reward difference coefficients to reweigh sample pairs in offline RLHF. We then develop a difference model which captures rich interactions between a pair of responses for predicting these difference coefficients. Experiments with 7B LLMs on the HH and TL;DR datasets substantiate the effectiveness of our method in both automatic metrics and human evaluation, thereby highlighting its potential for aligning LLMs with human intent and values

Reward Difference Optimization For Sample Reweighting In Offline RLHF

TL;DR

This work proposes a simple yet effective solution called Reward Difference Optimization, shorted as RDO, which introduces reward difference coefficients to reweigh sample pairs in offline RLHF and develops a difference model which captures rich interactions between a pair of responses for predicting these difference coefficients.

Abstract

With the rapid advances in Large Language Models (LLMs), aligning LLMs with human preferences become increasingly important. Although Reinforcement Learning with Human Feedback (RLHF) proves effective, it is complicated and highly resource-intensive. As such, offline RLHF has been introduced as an alternative solution, which directly optimizes LLMs with ranking losses on a fixed preference dataset. Current offline RLHF only captures the "ordinal relationship" between responses, overlooking the crucial aspect of how much one is preferred over the others. To address this issue, we propose a simple yet effective solution called Reward Difference Optimization, shorted as RDO. Specifically, we introduce reward difference coefficients to reweigh sample pairs in offline RLHF. We then develop a difference model which captures rich interactions between a pair of responses for predicting these difference coefficients. Experiments with 7B LLMs on the HH and TL;DR datasets substantiate the effectiveness of our method in both automatic metrics and human evaluation, thereby highlighting its potential for aligning LLMs with human intent and values
Paper Structure (14 sections, 9 equations, 2 figures, 1 table)

This paper contains 14 sections, 9 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Offline RLHF methods only care about the binary relation between responses (i.e., which response is better). However, given the query, some responses may show similar quality (Response 1 and 2 in the figure) while others may be obviously worse (Response 3 in the figure). The number in the right upper corner is the score given by the reward model.
  • Figure 2: The pipeline of traditional offline alignment methods (the upper side) and our proposed Reward Difference Optimization (i.e., RDO) pipeline with more accurate supervision signals (the lower part). The special tokens are omitted in the figure to save space. Instead of using the reward model to identify the ordinal relation between two responses (i.e., win or lose), we propose to use a difference model to predict the difference score between two responses directly and then use this score to help supervise the alignment process more precisely.