Table of Contents
Fetching ...

Writing-Zero: Bridge the Gap Between Non-verifiable Tasks and Verifiable Rewards

Ruipeng Jia, Yunyi Yang, Yongbo Gai, Kai Luo, Shihao Huang, Jianhe Lin, Xiaoxi Jiang, Guanjun Jiang

TL;DR

This work extends reinforcement learning with verifiable rewards to non-verifiable tasks by introducing a Pairwise Generative Reward Model (GenRM) guided by self-principled critiques and a Bootstrapped Relative Policy Optimization (BRPO) algorithm. By converting subjective writing quality into verifiable pairwise rewards and enabling reference-free, bootstrap-based policy updates, the approach yields robust writing abilities and resistance to reward hacking, without supervised fine-tuning. Empirical results show competitive performance on in-house and open benchmarks, with notable gains over scalar-reward baselines and demonstrated test-time scaling benefits. The findings suggest a broader RLVR framework that unifies rule-based, reference-based, and reference-free reward modeling for a wide spectrum of language tasks.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has enabled large language models (LLMs) to achieve remarkable breakthroughs in reasoning tasks with objective ground-truth answers, such as mathematics and code generation. However, a significant gap remains for non-verifiable tasks, like creative writing and open-ended dialogue, where quality assessment is inherently subjective and lacks definitive references. Existing approaches for these domains often rely on scalar reward models trained with human preferences, which suffer from limited generalization and are prone to reward hacking, such as over-explanation and length bias. In this work, we propose a unified RLVR-based training paradigm that bridges the gap between non-verifiable tasks and verifiable rewards. We introduce a writing-principle-based pairwise Generative Reward Model (GenRM) and a novel Bootstrapped Relative Policy Optimization (BRPO) algorithm. The pairwise writing GenRM leverages self-principled critique to transform subjective assessments into reliable, verifiable rewards, while BRPO enables dynamic, reference-free pairwise comparison by leveraging a bootstrapped response as temporary reference from within group rollouts during RL training. Our approach empowers LLMs to develop robust writing capabilities without supervised fine-tuning, as demonstrated by Writing-Zero, which shows consistent improvement and strong resistance to reward hacking compared to scalar reward baselines. Furthermore, our method achieves competitive results on both in-house and open-source writing benchmarks. Our findings suggest the potential to unify rule-based, reference-based, and reference-free reward modeling under the RLVR framework, thus paving the way for a comprehensive and scalable RL training paradigm applicable across all language tasks.

Writing-Zero: Bridge the Gap Between Non-verifiable Tasks and Verifiable Rewards

TL;DR

This work extends reinforcement learning with verifiable rewards to non-verifiable tasks by introducing a Pairwise Generative Reward Model (GenRM) guided by self-principled critiques and a Bootstrapped Relative Policy Optimization (BRPO) algorithm. By converting subjective writing quality into verifiable pairwise rewards and enabling reference-free, bootstrap-based policy updates, the approach yields robust writing abilities and resistance to reward hacking, without supervised fine-tuning. Empirical results show competitive performance on in-house and open benchmarks, with notable gains over scalar-reward baselines and demonstrated test-time scaling benefits. The findings suggest a broader RLVR framework that unifies rule-based, reference-based, and reference-free reward modeling for a wide spectrum of language tasks.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has enabled large language models (LLMs) to achieve remarkable breakthroughs in reasoning tasks with objective ground-truth answers, such as mathematics and code generation. However, a significant gap remains for non-verifiable tasks, like creative writing and open-ended dialogue, where quality assessment is inherently subjective and lacks definitive references. Existing approaches for these domains often rely on scalar reward models trained with human preferences, which suffer from limited generalization and are prone to reward hacking, such as over-explanation and length bias. In this work, we propose a unified RLVR-based training paradigm that bridges the gap between non-verifiable tasks and verifiable rewards. We introduce a writing-principle-based pairwise Generative Reward Model (GenRM) and a novel Bootstrapped Relative Policy Optimization (BRPO) algorithm. The pairwise writing GenRM leverages self-principled critique to transform subjective assessments into reliable, verifiable rewards, while BRPO enables dynamic, reference-free pairwise comparison by leveraging a bootstrapped response as temporary reference from within group rollouts during RL training. Our approach empowers LLMs to develop robust writing capabilities without supervised fine-tuning, as demonstrated by Writing-Zero, which shows consistent improvement and strong resistance to reward hacking compared to scalar reward baselines. Furthermore, our method achieves competitive results on both in-house and open-source writing benchmarks. Our findings suggest the potential to unify rule-based, reference-based, and reference-free reward modeling under the RLVR framework, thus paving the way for a comprehensive and scalable RL training paradigm applicable across all language tasks.

Paper Structure

This paper contains 40 sections, 12 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Comparison of Eval RM scores during Writing-Zero training. The blue line shows Qwen3-32B-Base-ScalarRM-GRPO, while the orange line shows Writing-Zero (Qwen3-32B-Base-GenRM-BRPO). The dashed red and green lines indicate the SFT (Writing-SFT) and SFT+RL (Writing-SFT-GenRM-BRPO) baselines, respectively.
  • Figure 2: Demonstration of GRPO and our BRPO. BRPO implements bootstrap by randomly selecting reference response from the current group of policy responses, and achieves zero expectation for the final advantages directly.
  • Figure 3: GenRM training dynamics, showing test accuracy and training data drop rate.
  • Figure 4: Convergence of the GenRM's preference ratio during RL training.
  • Figure 5: Majority voting accuracy (voting@$n$) across $n = 1, 2, 4, 8$ on internal datasets, as evaluated by the pairwise GenRM.