Table of Contents
Fetching ...

Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback

Jiayi Zhou, Jiaming Ji, Juntao Dai, Dong Li, Yaodong Yang

TL;DR

The work addresses biased optimization in RLHF caused by scalar reward modeling and proposes a novel sequence-to-sequence reward modeling (seq2seq RM) that learns from language feedback rather than scalar signals. By employing Correction Mapping and Identity Mapping with sequence MLE, and by extracting token-level positive and negative feedback from sequence divergence, seq2seq RM provides finer-grained credit assignments and stronger alignment signals. Empirical results show reduced long-response bias and refusal-to-response behavior, with improved alignment across 2B and 7B models on three NLP tasks and robust performance under out-of-distribution prompts, achieving an average win rate of 76.9%. The method does not require extra annotations or new models, and it enhances both the accuracy and granularity of reward signals, contributing to safer and more reliable RLHF deployments.

Abstract

Aligning the behavior of Large language models (LLMs) with human intentions and values remains a critical challenge. Reinforcement learning from human feedback (RLHF) aligns LLMs by training a reward model (RM) on human preferences and fine-tuning the LLMs to maximize RM feedback. Despite its effectiveness and popularity, RLHF is prone to biased local optimization. It means RM fails to provide feedback that accurately aligns with human preference, causing LLMs to explore unexpected generalizations, and failing to achieve alignment objectives. To mitigate this issue, we propose a novel \textit{sequence-to-sequence (seq2seq) reward modeling} method. Its key insight is that learning from language feedback rather than scalar feedback improves RLHF without additional annotations. We replaced the reward modeling target from binary maximum likelihood estimation (MLE) with sequence MLE. This method enables richer and fine-grained language feedback without additional annotations, models, or training stages. Our experiments demonstrated its effectiveness, specifically, reducing the refusal-to-response paradigm in single-turn safety dialogues and the long-response bias in text summarization tasks. We provide further analysis that seq2seq RM improves RLHF performance across 2B and 7B LLMs on 3 NLP tasks, achieving an average win rate of 76.9\%. We further show that seq2seq RM can still improve the performance of RLHF under out-of-distribution prompts.

Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback

TL;DR

The work addresses biased optimization in RLHF caused by scalar reward modeling and proposes a novel sequence-to-sequence reward modeling (seq2seq RM) that learns from language feedback rather than scalar signals. By employing Correction Mapping and Identity Mapping with sequence MLE, and by extracting token-level positive and negative feedback from sequence divergence, seq2seq RM provides finer-grained credit assignments and stronger alignment signals. Empirical results show reduced long-response bias and refusal-to-response behavior, with improved alignment across 2B and 7B models on three NLP tasks and robust performance under out-of-distribution prompts, achieving an average win rate of 76.9%. The method does not require extra annotations or new models, and it enhances both the accuracy and granularity of reward signals, contributing to safer and more reliable RLHF deployments.

Abstract

Aligning the behavior of Large language models (LLMs) with human intentions and values remains a critical challenge. Reinforcement learning from human feedback (RLHF) aligns LLMs by training a reward model (RM) on human preferences and fine-tuning the LLMs to maximize RM feedback. Despite its effectiveness and popularity, RLHF is prone to biased local optimization. It means RM fails to provide feedback that accurately aligns with human preference, causing LLMs to explore unexpected generalizations, and failing to achieve alignment objectives. To mitigate this issue, we propose a novel \textit{sequence-to-sequence (seq2seq) reward modeling} method. Its key insight is that learning from language feedback rather than scalar feedback improves RLHF without additional annotations. We replaced the reward modeling target from binary maximum likelihood estimation (MLE) with sequence MLE. This method enables richer and fine-grained language feedback without additional annotations, models, or training stages. Our experiments demonstrated its effectiveness, specifically, reducing the refusal-to-response paradigm in single-turn safety dialogues and the long-response bias in text summarization tasks. We provide further analysis that seq2seq RM improves RLHF performance across 2B and 7B LLMs on 3 NLP tasks, achieving an average win rate of 76.9\%. We further show that seq2seq RM can still improve the performance of RLHF under out-of-distribution prompts.
Paper Structure (32 sections, 7 equations, 6 figures, 5 tables)

This paper contains 32 sections, 7 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The traditional sequence-to-scalar reward model provides only coarse-grained scalar feedback on the LLM's response, making it prone to exploiting unexpected generalization for high rewards during the RL fine-tuning phase skalse2023misspecification, e.g., falling into a refusal-to-answer paradigm bai2022trainingdai2023safe. Without additional annotations, we propose a novel sequence-to-sequence reward modeling method that offers richer language feedback, improving RLHF performance.
  • Figure 2: Overview of seq2seq reward modeling pipeline. Our pipeline consists of two stages: (1) Reward Modeling: We make the seq2seq RM output the chosen response when the rejected response is input, i.e., Correction Mapping, and output the chosen response when the chosen response is input, i.e., Identity Mapping, by sequence maximum loglikelihood estimation (MLE). (2) Reward Extracting: We reward the response with a positive score until it diverges from the seq2seq RM output, after which we input the response token-by-token, assigning negative scores to those diverging tokens.
  • Figure 3: An example of extracting positive and negative scores from seq2seq RM. $\bm{x}$ denotes the prompt, $\bm{y}$ is the LLMs generated response, <BOS> is the beginning of string token, and $r^{\text{seq}}(\cdot)$ is the inference function of seq2seq RM.
  • Figure 4: Utility and safety scores distribution shifts of safety alignment. RLHF based on seq2scalar RM (upper) causes LLMs to misgeneralize the refusal-to-response paradigm. Our seq2seq RM (lower) mitigates unexpected generalization, improving safety while maintaining the utility scores distribution.
  • Figure 5: Comparison of PPO-T, PPO-T-Pos and Init-SFT. Bars above the dashed line are based on Gemma-2B while the others are based on Llama2-7B.
  • ...and 1 more figures