Table of Contents
Fetching ...

Prompt-Based Length Controlled Generation with Reinforcement Learning

Renlong Jie, Xiaojun Meng, Lifeng Shang, Xin Jiang, Qun Liu

TL;DR

<3-5 sentence high-level summary> The paper addresses the challenge of enforcing precise output length in autoregressive LLMs, which is important for real-world applications and can reduce inference cost. It introduces a prompt-based length-control framework that converts diverse user prompts into Standard Control Prompts (SCPs) via a Standard Prompt Extractor (SPE) and then trains LLMs with reinforcement learning using either a rule-based or a model-based reward, leveraging Proximal Policy Optimization (PPO). The method includes a novel sampling-filter mechanism that ranks generated candidates by a reward model during inference, and demonstrates substantial reductions in length-control error on CNNDM and NYT, with strong generalization to unseen prompt templates. The approach offers practical improvements for user-facing length control and can be extended to other controllable formats beyond length.

Abstract

Large language models (LLMs) like ChatGPT and GPT-4 have attracted great attention given their surprising performance on a wide range of NLP tasks. Length controlled generation of LLMs emerges as an important topic, which enables users to fully leverage the capability of LLMs in more real-world scenarios like generating a proper answer or essay of a desired length. In addition, the autoregressive generation in LLMs is extremely time-consuming, while the ability of controlling this generated length can reduce the inference cost by limiting the length. Therefore, we propose a prompt-based length control method to achieve high-accuracy length controlled generation. In particular, we adopt reinforcement learning with the reward signal given by either trainable or rule-based reward models, which further enhances the length-control ability of LLMs by rewarding outputs that follows pre-defined control instruction. To enable rule-based inference, we also introduce standard prompt extractor to collect the standard control information from users' input. Experiments show that our method significantly improves the accuracy of prompt-based length control for summarization task on popular datasets like CNNDM and NYT. Both the standard prompt extractor and the RL-tuned model have show strong generalization ability to unseen control prompt templates.

Prompt-Based Length Controlled Generation with Reinforcement Learning

TL;DR

<3-5 sentence high-level summary> The paper addresses the challenge of enforcing precise output length in autoregressive LLMs, which is important for real-world applications and can reduce inference cost. It introduces a prompt-based length-control framework that converts diverse user prompts into Standard Control Prompts (SCPs) via a Standard Prompt Extractor (SPE) and then trains LLMs with reinforcement learning using either a rule-based or a model-based reward, leveraging Proximal Policy Optimization (PPO). The method includes a novel sampling-filter mechanism that ranks generated candidates by a reward model during inference, and demonstrates substantial reductions in length-control error on CNNDM and NYT, with strong generalization to unseen prompt templates. The approach offers practical improvements for user-facing length control and can be extended to other controllable formats beyond length.

Abstract

Large language models (LLMs) like ChatGPT and GPT-4 have attracted great attention given their surprising performance on a wide range of NLP tasks. Length controlled generation of LLMs emerges as an important topic, which enables users to fully leverage the capability of LLMs in more real-world scenarios like generating a proper answer or essay of a desired length. In addition, the autoregressive generation in LLMs is extremely time-consuming, while the ability of controlling this generated length can reduce the inference cost by limiting the length. Therefore, we propose a prompt-based length control method to achieve high-accuracy length controlled generation. In particular, we adopt reinforcement learning with the reward signal given by either trainable or rule-based reward models, which further enhances the length-control ability of LLMs by rewarding outputs that follows pre-defined control instruction. To enable rule-based inference, we also introduce standard prompt extractor to collect the standard control information from users' input. Experiments show that our method significantly improves the accuracy of prompt-based length control for summarization task on popular datasets like CNNDM and NYT. Both the standard prompt extractor and the RL-tuned model have show strong generalization ability to unseen control prompt templates.
Paper Structure (34 sections, 3 equations, 7 figures, 16 tables)

This paper contains 34 sections, 3 equations, 7 figures, 16 tables.

Figures (7)

  • Figure 1: Overview of the model architecture. In training stage, the scores given by the reward model are used for the reinforcement learning method. In inference stage, the scores are applied for ranking and selecting the output sequences generated by LLMs.
  • Figure 2: The demonstration of Standard Prompt Extractor (SPE). The generative type of models are trained to output the standard control prompts (SCPs) directly (left), while the discriminative type of models are trained to predict the type of each control instruction, as well as the requested number of lengths from user utterance, such as the minimum value and the maximum value (right).
  • Figure 3: Learning Curves of Standard Prompt Extractors. (a) Validation losses of GPT extractor. (b) Validation losses of BERT extractor. (c) Matching accuracy of GPT extractor. (c) Matching accuracy of BERT extractor. We show the curves of validation cross entropy and matching rate for both cases.
  • Figure 4: The Diagram of Learning Curves with GPT-S for single-type control instruction (only for "equal to") without sample filtering..
  • Figure 5: The Diagram of Learning Curves with GPT-S for multi-type control instructions without sample filtering.
  • ...and 2 more figures