Table of Contents
Fetching ...

Beyond Sparse Rewards: Enhancing Reinforcement Learning with Language Model Critique in Text Generation

Meng Cao, Lei Shu, Lei Yu, Yun Zhu, Nevan Wichers, Yinxiao Liu, Lei Meng

TL;DR

This work introduces RELC, a reinforcement learning framework for text generation that uses a critic language model to emit dense, intermediate intrinsic rewards at token or span granularity, addressing reward sparsity in environment signals. By coupling a policy LM with a frozen critic LM and training with PPO, RELC integrates intrinsic and extrinsic rewards to guide generation across sentiment control, detoxification, and summarization. Across three tasks, RELC demonstrates improved sample efficiency and superior or competitive performance against strong baselines, as validated by automatic metrics and human evaluations. The approach offers a practical path to more efficient and controllable language model RL, while acknowledging limitations related to critic size and potential API-related delays.

Abstract

Reinforcement learning (RL) can align language models with non-differentiable reward signals, such as human preferences. However, a major challenge arises from the sparsity of these reward signals - typically, there is only a single reward for an entire output. This sparsity of rewards can lead to inefficient and unstable learning. To address this challenge, our paper introduces an novel framework that utilizes the critique capability of Large Language Models (LLMs) to produce intermediate-step rewards during RL training. Our method involves coupling a policy model with a critic language model, which is responsible for providing comprehensive feedback of each part of the output. This feedback is then translated into token or span-level rewards that can be used to guide the RL training process. We investigate this approach under two different settings: one where the policy model is smaller and is paired with a more powerful critic model, and another where a single language model fulfills both roles. We assess our approach on three text generation tasks: sentiment control, language model detoxification, and summarization. Experimental results show that incorporating artificial intrinsic rewards significantly improve both sample efficiency and the overall performance of the policy model, supported by both automatic and human evaluation.

Beyond Sparse Rewards: Enhancing Reinforcement Learning with Language Model Critique in Text Generation

TL;DR

This work introduces RELC, a reinforcement learning framework for text generation that uses a critic language model to emit dense, intermediate intrinsic rewards at token or span granularity, addressing reward sparsity in environment signals. By coupling a policy LM with a frozen critic LM and training with PPO, RELC integrates intrinsic and extrinsic rewards to guide generation across sentiment control, detoxification, and summarization. Across three tasks, RELC demonstrates improved sample efficiency and superior or competitive performance against strong baselines, as validated by automatic metrics and human evaluations. The approach offers a practical path to more efficient and controllable language model RL, while acknowledging limitations related to critic size and potential API-related delays.

Abstract

Reinforcement learning (RL) can align language models with non-differentiable reward signals, such as human preferences. However, a major challenge arises from the sparsity of these reward signals - typically, there is only a single reward for an entire output. This sparsity of rewards can lead to inefficient and unstable learning. To address this challenge, our paper introduces an novel framework that utilizes the critique capability of Large Language Models (LLMs) to produce intermediate-step rewards during RL training. Our method involves coupling a policy model with a critic language model, which is responsible for providing comprehensive feedback of each part of the output. This feedback is then translated into token or span-level rewards that can be used to guide the RL training process. We investigate this approach under two different settings: one where the policy model is smaller and is paired with a more powerful critic model, and another where a single language model fulfills both roles. We assess our approach on three text generation tasks: sentiment control, language model detoxification, and summarization. Experimental results show that incorporating artificial intrinsic rewards significantly improve both sample efficiency and the overall performance of the policy model, supported by both automatic and human evaluation.
Paper Structure (38 sections, 3 equations, 14 figures, 9 tables)

This paper contains 38 sections, 3 equations, 14 figures, 9 tables.

Figures (14)

  • Figure 1: Illustration of the proposed framework. There are two modules inside the agent. The critic LM takes the state and reward as input and generates dense intrinsic reward signals that evaluate different parts of the generation. The policy module is trained to optimize the weighted sum of intrinsic and extrinsic rewards.
  • Figure 2: An example demonstrating the reward calculation process in the sentiment control task. In this example, the external environment returns a scalar reward of -2 in response to the policy model's output. Subsequently, the critic model is prompted to identify spans of positive and negative sentiment within the output. Tokens within these spans are then assigned intrinsic rewards: +1 for positive and -1 for negative sentiment. The hyper-parameter $\alpha$ determines the weight of these two types of rewards. The extrinsic reward is assigned to the last position in the output sequence.
  • Figure 3: GPT-2 large as policy LM and GPT-3.5 as critic
  • Figure 4: Self-critique using Llama 2 7B
  • Figure 6: GPT-2 large as policy LM and GPT-3.5 as critic
  • ...and 9 more figures