Table of Contents
Fetching ...

TLCR: Token-Level Continuous Reward for Fine-grained Reinforcement Learning from Human Feedback

Eunseop Yoon, Hee Suk Yoon, SooHwan Eom, Gunsoo Han, Daniel Wontae Nam, Daejin Jo, Kyoung-Woon On, Mark A. Hasegawa-Johnson, Sungwoong Kim, Chang D. Yoo

TL;DR

This work presents TLCR, a token-level continuous reward framework for RLHF that replaces sparse sequence rewards with context-aware per-token signals. A token-level discriminator is trained from token-level preferences derived by an external reviser using Levenshtein edits, yielding $D_{\phi}$ confidence scores that map to per-token rewards via $r_t = 2 \cdot D_{\phi}(a_t|x, a_{0:t-1}) - 1$ for PPO updates. Across MT-Bench, AlpacaEval, and human evaluation on full-hh-rlhf data, TLCR consistently outperforms sequence-level PPO, token-level discrete reward methods, and fixed-reward baselines, demonstrating finer-grained alignment with human preferences. The results indicate that continuous token-level rewards enable more precise guidance for language model alignment in open-ended generation tasks, with potential broader impact on safer and more reliable AI systems.

Abstract

Reinforcement Learning from Human Feedback (RLHF) leverages human preference data to train language models to align more closely with human essence. These human preference data, however, are labeled at the sequence level, creating a mismatch between sequence-level preference labels and tokens, which are autoregressively generated from the language model. Although several recent approaches have tried to provide token-level (i.e., dense) rewards for each individual token, these typically rely on predefined discrete reward values (e.g., positive: +1, negative: -1, neutral: 0), failing to account for varying degrees of preference inherent to each token. To address this limitation, we introduce TLCR (Token-Level Continuous Reward) for RLHF, which incorporates a discriminator trained to distinguish positive and negative tokens, and the confidence of the discriminator is used to assign continuous rewards to each token considering the context. Extensive experiments show that our proposed TLCR leads to consistent performance improvements over previous sequence-level or token-level discrete rewards on open-ended generation benchmarks.

TLCR: Token-Level Continuous Reward for Fine-grained Reinforcement Learning from Human Feedback

TL;DR

This work presents TLCR, a token-level continuous reward framework for RLHF that replaces sparse sequence rewards with context-aware per-token signals. A token-level discriminator is trained from token-level preferences derived by an external reviser using Levenshtein edits, yielding confidence scores that map to per-token rewards via for PPO updates. Across MT-Bench, AlpacaEval, and human evaluation on full-hh-rlhf data, TLCR consistently outperforms sequence-level PPO, token-level discrete reward methods, and fixed-reward baselines, demonstrating finer-grained alignment with human preferences. The results indicate that continuous token-level rewards enable more precise guidance for language model alignment in open-ended generation tasks, with potential broader impact on safer and more reliable AI systems.

Abstract

Reinforcement Learning from Human Feedback (RLHF) leverages human preference data to train language models to align more closely with human essence. These human preference data, however, are labeled at the sequence level, creating a mismatch between sequence-level preference labels and tokens, which are autoregressively generated from the language model. Although several recent approaches have tried to provide token-level (i.e., dense) rewards for each individual token, these typically rely on predefined discrete reward values (e.g., positive: +1, negative: -1, neutral: 0), failing to account for varying degrees of preference inherent to each token. To address this limitation, we introduce TLCR (Token-Level Continuous Reward) for RLHF, which incorporates a discriminator trained to distinguish positive and negative tokens, and the confidence of the discriminator is used to assign continuous rewards to each token considering the context. Extensive experiments show that our proposed TLCR leads to consistent performance improvements over previous sequence-level or token-level discrete rewards on open-ended generation benchmarks.
Paper Structure (34 sections, 7 equations, 9 figures, 2 tables)

This paper contains 34 sections, 7 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Illustration of different granularity of rewards in RLHF. (a) Sequence-Level Reward provides a singular preference value for the entire sequence. (b) Token-Level Discrete Reward allocates fixed discrete reward values for each token. (c) Our proposed Token-Level Continuous Reward assigns each token a continuous range of rewards.
  • Figure 2: Illustration of the training procedure of the discriminator used in TLCR (Token-Level Continuous Reward). (a) Using the sequence-level labeled dataset, we utilize an external mature language model $\text{LLM}_\text{ext}$ as a reviser to obtain token-level preference labels. $\text{LLM}_\text{ext}$ is instructed to compare the chosen ($y_c$) and rejected response ($y_m$), reason why the chosen is preferred, and create modified response $y_m$ by modifying the rejected response with minimal editing. Using the Levenshtein Distance between $y_r$ and $y_m$, we assign token-wise preference labels based on whether the tokens have been added, deleted, or substituted. (b) With the token-wise preference label created from the previous step, we train a discriminator to discriminate positive, neutral, and negative tokens.
  • Figure 3: Illustration of using the discriminator for assigning token-level continuous reward during PPO. The discriminator's prediction probability of a token being positive undergoes normalization to fit a scale from -1 to 1. A value near -1 signifies an unfavorable preference, near 1 suggests a favorable preference, and around 0 denotes a neutral preference.
  • Figure 4: Evaluation on test questions from AlpacaEval dataset alpaca_eval.
  • Figure 5: Human evaluation results on 100 random samples from full-hh-rlhf test set. Five annotators were tasked with selecting the most preferred response generated by different methods. We report the average proportion of preferences chosen to each method's outputs
  • ...and 4 more figures