ACE-RL: Adaptive Constraint-Enhanced Reward for Long-form Generation Reinforcement Learning
Jianghao Chen, Wei Sun, Qixiang Yin, Zhixing Tan, Jiajun Zhang
TL;DR
ACE-RL tackles the challenge of optimizing long-form generation by replacing coarse reward signals with fine-grained, instruction-adaptive constraint verification. The approach automatically constructs constraint checklists across content completeness, structural logic, and stylistic formatting, and uses a verifier LLM to score constraint satisfaction, combined with a length reward, within a GRPO-based RL framework. Empirical results on WritingBench and Arena-Write show substantial gains over SFT and LLM-as-a-Judge RL, with a top configuration approaching or surpassing proprietary systems. The work demonstrates a scalable, verifiable reward paradigm that improves long-form writing quality without relying on extensive high-quality paired data.
Abstract
Long-form generation has become a critical and challenging application for Large Language Models (LLMs). Existing studies are limited by their reliance on scarce, high-quality long-form response data and their focus on coarse-grained, general-purpose metrics (e.g., coherence and helpfulness), overlooking the nuanced, scenario-specific requirements of real-world tasks. To address these limitations, we propose a framework utilizing Adaptive Constraint-Enhanced reward for long-form generation Reinforcement Learning (ACE-RL). ACE-RL first decomposes each instruction into a set of fine-grained, adaptive constraint criteria spanning key dimensions of long-form generation tasks. Subsequently, we design a reward mechanism to quantify the response quality based on their satisfaction over corresponding constraints, converting subjective quality evaluation into constraint verification. Finally, we leverage reinforcement learning to optimize LLMs using these fine-grained signals. Experimental results show that ACE-RL significantly outperforms existing SFT and RL baselines by 18.63% and 7.61% on WritingBench, and our top-performing model even surpasses proprietary systems like GPT-4o by 8.76%, providing a more effective training paradigm in long-form generation scenarios.
