Table of Contents
Fetching ...

ACE-RL: Adaptive Constraint-Enhanced Reward for Long-form Generation Reinforcement Learning

Jianghao Chen, Wei Sun, Qixiang Yin, Zhixing Tan, Jiajun Zhang

TL;DR

ACE-RL tackles the challenge of optimizing long-form generation by replacing coarse reward signals with fine-grained, instruction-adaptive constraint verification. The approach automatically constructs constraint checklists across content completeness, structural logic, and stylistic formatting, and uses a verifier LLM to score constraint satisfaction, combined with a length reward, within a GRPO-based RL framework. Empirical results on WritingBench and Arena-Write show substantial gains over SFT and LLM-as-a-Judge RL, with a top configuration approaching or surpassing proprietary systems. The work demonstrates a scalable, verifiable reward paradigm that improves long-form writing quality without relying on extensive high-quality paired data.

Abstract

Long-form generation has become a critical and challenging application for Large Language Models (LLMs). Existing studies are limited by their reliance on scarce, high-quality long-form response data and their focus on coarse-grained, general-purpose metrics (e.g., coherence and helpfulness), overlooking the nuanced, scenario-specific requirements of real-world tasks. To address these limitations, we propose a framework utilizing Adaptive Constraint-Enhanced reward for long-form generation Reinforcement Learning (ACE-RL). ACE-RL first decomposes each instruction into a set of fine-grained, adaptive constraint criteria spanning key dimensions of long-form generation tasks. Subsequently, we design a reward mechanism to quantify the response quality based on their satisfaction over corresponding constraints, converting subjective quality evaluation into constraint verification. Finally, we leverage reinforcement learning to optimize LLMs using these fine-grained signals. Experimental results show that ACE-RL significantly outperforms existing SFT and RL baselines by 18.63% and 7.61% on WritingBench, and our top-performing model even surpasses proprietary systems like GPT-4o by 8.76%, providing a more effective training paradigm in long-form generation scenarios.

ACE-RL: Adaptive Constraint-Enhanced Reward for Long-form Generation Reinforcement Learning

TL;DR

ACE-RL tackles the challenge of optimizing long-form generation by replacing coarse reward signals with fine-grained, instruction-adaptive constraint verification. The approach automatically constructs constraint checklists across content completeness, structural logic, and stylistic formatting, and uses a verifier LLM to score constraint satisfaction, combined with a length reward, within a GRPO-based RL framework. Empirical results on WritingBench and Arena-Write show substantial gains over SFT and LLM-as-a-Judge RL, with a top configuration approaching or surpassing proprietary systems. The work demonstrates a scalable, verifiable reward paradigm that improves long-form writing quality without relying on extensive high-quality paired data.

Abstract

Long-form generation has become a critical and challenging application for Large Language Models (LLMs). Existing studies are limited by their reliance on scarce, high-quality long-form response data and their focus on coarse-grained, general-purpose metrics (e.g., coherence and helpfulness), overlooking the nuanced, scenario-specific requirements of real-world tasks. To address these limitations, we propose a framework utilizing Adaptive Constraint-Enhanced reward for long-form generation Reinforcement Learning (ACE-RL). ACE-RL first decomposes each instruction into a set of fine-grained, adaptive constraint criteria spanning key dimensions of long-form generation tasks. Subsequently, we design a reward mechanism to quantify the response quality based on their satisfaction over corresponding constraints, converting subjective quality evaluation into constraint verification. Finally, we leverage reinforcement learning to optimize LLMs using these fine-grained signals. Experimental results show that ACE-RL significantly outperforms existing SFT and RL baselines by 18.63% and 7.61% on WritingBench, and our top-performing model even surpasses proprietary systems like GPT-4o by 8.76%, providing a more effective training paradigm in long-form generation scenarios.

Paper Structure

This paper contains 41 sections, 6 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison of reward mechanisms: conventional methods vs. our proposed method.
  • Figure 2: The overall framework of ACE-RL. First, we collect diverse instructions for long-form generation tasks and create an instruction-adaptive constraint checklist for each across three dimensions. Second, a reward model is deployed to verify whether the policy model's responses meet each constraint. This constraint-enhanced reward, along with a length reward, are then used for RL training.
  • Figure 3: Examples of constraint generation from real-world user instructions across three key dimensions.
  • Figure 4: The comparison of the average group standard deviation of reward value.
  • Figure 5: Human preference evaluation between our ACE-RL method and different baselines.