Prompt-Based Length Controlled Generation with Reinforcement Learning
Renlong Jie, Xiaojun Meng, Lifeng Shang, Xin Jiang, Qun Liu
TL;DR
<3-5 sentence high-level summary> The paper addresses the challenge of enforcing precise output length in autoregressive LLMs, which is important for real-world applications and can reduce inference cost. It introduces a prompt-based length-control framework that converts diverse user prompts into Standard Control Prompts (SCPs) via a Standard Prompt Extractor (SPE) and then trains LLMs with reinforcement learning using either a rule-based or a model-based reward, leveraging Proximal Policy Optimization (PPO). The method includes a novel sampling-filter mechanism that ranks generated candidates by a reward model during inference, and demonstrates substantial reductions in length-control error on CNNDM and NYT, with strong generalization to unseen prompt templates. The approach offers practical improvements for user-facing length control and can be extended to other controllable formats beyond length.
Abstract
Large language models (LLMs) like ChatGPT and GPT-4 have attracted great attention given their surprising performance on a wide range of NLP tasks. Length controlled generation of LLMs emerges as an important topic, which enables users to fully leverage the capability of LLMs in more real-world scenarios like generating a proper answer or essay of a desired length. In addition, the autoregressive generation in LLMs is extremely time-consuming, while the ability of controlling this generated length can reduce the inference cost by limiting the length. Therefore, we propose a prompt-based length control method to achieve high-accuracy length controlled generation. In particular, we adopt reinforcement learning with the reward signal given by either trainable or rule-based reward models, which further enhances the length-control ability of LLMs by rewarding outputs that follows pre-defined control instruction. To enable rule-based inference, we also introduce standard prompt extractor to collect the standard control information from users' input. Experiments show that our method significantly improves the accuracy of prompt-based length control for summarization task on popular datasets like CNNDM and NYT. Both the standard prompt extractor and the RL-tuned model have show strong generalization ability to unseen control prompt templates.
