Table of Contents
Fetching ...

Dr Genre: Reinforcement Learning from Decoupled LLM Feedback for Generic Text Rewriting

Yufei Li, John Nham, Ganesh Jawahar, Lei Shu, David Uthus, Yun-Hsuan Sung, Chengrun Yang, Itai Rolnick, Yi Qiao, Cong Liu

TL;DR

This work tackles the challenge of generic text rewriting by introducing Dr Genré, a decoupled-reward reinforcement learning framework that combines three rewrite objectives—factuality, style, and conversation—via task-specific reward weights. It builds ChatRewrite alongside LongFact and RewriteLM to form a broad benchmark for evaluation and demonstrates that dynamic, decoupled rewards yield higher-quality rewrites across multiple tasks, improving agreement, coherence, and conciseness. The approach leverages supervised fine-tuning on mixed data, LLM-based reward modeling, and PPO-based RL with a KL constraint to maintain fidelity to the reference policy. AutoRater evaluations and case studies indicate that Dr Genré can adapt alignment direction to task requirements and outperform single-reward baselines, highlighting its potential for robust, general-purpose text rewriting in real-world user scenarios.

Abstract

Generic text rewriting is a prevalent large language model (LLM) application that covers diverse real-world tasks, such as style transfer, fact correction, and email editing. These tasks vary in rewriting objectives (e.g., factual consistency vs. semantic preservation), making it challenging to develop a unified model that excels across all dimensions. Existing methods often specialize in either a single task or a specific objective, limiting their generalizability. In this work, we introduce a generic model proficient in factuality, stylistic, and conversational rewriting tasks. To simulate real-world user rewrite requests, we construct a conversational rewrite dataset, ChatRewrite, that presents ``natural''-sounding instructions, from raw emails using LLMs. Combined with other popular rewrite datasets, including LongFact for the factuality rewrite task and RewriteLM for the stylistic rewrite task, this forms a broad benchmark for training and evaluating generic rewrite models. To align with task-specific objectives, we propose Dr Genre, a Decoupled-reward learning framework for Generic rewriting, that utilizes objective-oriented reward models with a task-specific weighting. Evaluation shows that \approach delivers higher-quality rewrites across all targeted tasks, improving objectives including instruction following (agreement), internal consistency (coherence), and minimal unnecessary edits (conciseness).

Dr Genre: Reinforcement Learning from Decoupled LLM Feedback for Generic Text Rewriting

TL;DR

This work tackles the challenge of generic text rewriting by introducing Dr Genré, a decoupled-reward reinforcement learning framework that combines three rewrite objectives—factuality, style, and conversation—via task-specific reward weights. It builds ChatRewrite alongside LongFact and RewriteLM to form a broad benchmark for evaluation and demonstrates that dynamic, decoupled rewards yield higher-quality rewrites across multiple tasks, improving agreement, coherence, and conciseness. The approach leverages supervised fine-tuning on mixed data, LLM-based reward modeling, and PPO-based RL with a KL constraint to maintain fidelity to the reference policy. AutoRater evaluations and case studies indicate that Dr Genré can adapt alignment direction to task requirements and outperform single-reward baselines, highlighting its potential for robust, general-purpose text rewriting in real-world user scenarios.

Abstract

Generic text rewriting is a prevalent large language model (LLM) application that covers diverse real-world tasks, such as style transfer, fact correction, and email editing. These tasks vary in rewriting objectives (e.g., factual consistency vs. semantic preservation), making it challenging to develop a unified model that excels across all dimensions. Existing methods often specialize in either a single task or a specific objective, limiting their generalizability. In this work, we introduce a generic model proficient in factuality, stylistic, and conversational rewriting tasks. To simulate real-world user rewrite requests, we construct a conversational rewrite dataset, ChatRewrite, that presents ``natural''-sounding instructions, from raw emails using LLMs. Combined with other popular rewrite datasets, including LongFact for the factuality rewrite task and RewriteLM for the stylistic rewrite task, this forms a broad benchmark for training and evaluating generic rewrite models. To align with task-specific objectives, we propose Dr Genre, a Decoupled-reward learning framework for Generic rewriting, that utilizes objective-oriented reward models with a task-specific weighting. Evaluation shows that \approach delivers higher-quality rewrites across all targeted tasks, improving objectives including instruction following (agreement), internal consistency (coherence), and minimal unnecessary edits (conciseness).

Paper Structure

This paper contains 21 sections, 4 equations, 6 figures, 25 tables.

Figures (6)

  • Figure 1: An example illustrating the three objectives for generic text rewriting: Agreement: Follow the rewrite instruction (e.g., false claim correction). Coherence: Maintain global coherence across revisions (e.g., updating the age difference to maintain logical flow). Conciseness: Avoid unnecessary edits (e.g., specifying Bob's death place "lymphoma" is irrelevant).
  • Figure 2: Dr Genré: RL fine-tuning with weighted decoupled rewards. Dashed lines represent workflows of reward modeling (for agreement and coherence). IR, RR denote initial and revised responses.
  • Figure 4: Reward learning curves during RL fine-tuning under static and dynamic weighting.
  • Figure : (a) Agreement
  • Figure : (a) Agreement
  • ...and 1 more figures