Table of Contents
Fetching ...

Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model

Yueqin Yin, Shentao Yang, Yujia Xie, Ziyi Yang, Yuting Sun, Hany Awadalla, Weizhu Chen, Mingyuan Zhou

TL;DR

This paper proposes a segment-level reward model, which assigns a reward to each semantically complete text segment that spans over a short sequence of tokens, and generalizes the classical scalar bandit reward normalizers into location-aware normalizer functions and interpolate the segment reward for further densification.

Abstract

Reinforcement learning from human feedback (RLHF) has been widely adopted to align language models (LMs) with human preference. Prior RLHF works typically take a bandit formulation, which, though intuitive, ignores the sequential nature of LM generation and can suffer from the sparse reward issue. While recent works propose dense token-level RLHF, treating each token as an action may be oversubtle to proper reward assignment. In this paper, we seek to get the best of both by training and utilizing a segment-level reward model, which assigns a reward to each semantically complete text segment that spans over a short sequence of tokens. For reward learning, our method allows dynamic text segmentation and compatibility with standard sequence-preference datasets. For effective RL-based LM training against segment reward, we generalize the classical scalar bandit reward normalizers into location-aware normalizer functions and interpolate the segment reward for further densification. With these designs, our method performs competitively on three popular RLHF benchmarks for LM policy: AlpacaEval 2.0, Arena-Hard, and MT-Bench. Ablation studies are conducted to further demonstrate our method.

Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model

TL;DR

This paper proposes a segment-level reward model, which assigns a reward to each semantically complete text segment that spans over a short sequence of tokens, and generalizes the classical scalar bandit reward normalizers into location-aware normalizer functions and interpolate the segment reward for further densification.

Abstract

Reinforcement learning from human feedback (RLHF) has been widely adopted to align language models (LMs) with human preference. Prior RLHF works typically take a bandit formulation, which, though intuitive, ignores the sequential nature of LM generation and can suffer from the sparse reward issue. While recent works propose dense token-level RLHF, treating each token as an action may be oversubtle to proper reward assignment. In this paper, we seek to get the best of both by training and utilizing a segment-level reward model, which assigns a reward to each semantically complete text segment that spans over a short sequence of tokens. For reward learning, our method allows dynamic text segmentation and compatibility with standard sequence-preference datasets. For effective RL-based LM training against segment reward, we generalize the classical scalar bandit reward normalizers into location-aware normalizer functions and interpolate the segment reward for further densification. With these designs, our method performs competitively on three popular RLHF benchmarks for LM policy: AlpacaEval 2.0, Arena-Hard, and MT-Bench. Ablation studies are conducted to further demonstrate our method.
Paper Structure (18 sections, 10 equations, 4 figures, 6 tables, 1 algorithm)

This paper contains 18 sections, 10 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of training and utilizing our segment-level reward model. Numerics in the plot are artificial. In the figure, each text segment has a different color, and its starting word is underscored.
  • Figure 2: Examples of data in our reward training dataset, motivating us to use Average as the aggregation function $f(\cdot)$. It is clear that the average quality of the chosen responses is better than the rejected ones, as they contain some key phrases (highlighted in red) for responding to the given prompt. We note that the choice of $f(\cdot)$ ought to be task and dataset specific, as discussed in yang2023preferencegrounded.
  • Figure 3: Examples of dense reward assignment for text sequences encountered in PPO training, comparing our segment-level reward model and the recent token-level design on normal text (Top) and text with verbosity/repetition (Bottom). Darker color indicates higher reward. In the bottom half, repeated sentences are underlined.
  • Figure 4: Performance comparison among different entropy cutoffs $c_{\mathrm{ent}}$ for entropy-based text segmentation, comparing PPO-trained LM policy's benchmark scores and average segment length ("Avg. Seg. Len") in terms of number of tokens.