Table of Contents
Fetching ...

R2-Write: Reflection and Revision for Open-Ended Writing with Deep Reasoning

Wanlong Liu, Bo Zhang, Chenliang Li, Shaopeng Lai, Yuning Wu, Xuanyu Lei, Ming Yan

Abstract

While deep reasoning with long chain-of-thought has dramatically improved large language models in verifiable domains like mathematics, its effectiveness for open-ended tasks such as writing remains unexplored. In this paper, we conduct a systematic investigation revealing that existing mainstream reasoning models achieve limited gains on open-ended writing tasks. Our further analysis shows that these models lack deep reflection and revision patterns in open-ended writing, resulting in substantially smaller improvements compared to mathematical reasoning tasks. To address this limitation, we introduce R2-Write: an automated framework that synthesizes high-quality thinking trajectories enriched with explicit reflection and revision patterns through iterative writer-judge interaction. To prevent redundant reflections, we design a process reward mechanism that supervises reflection quality during reinforcement learning, improving both performance and token efficiency. Extensive experiments across multiple creative writing and deep-research benchmarks demonstrate significant improvements, validating that explicitly incorporating reflection and revision patterns unlocks deep reasoning capabilities for open-ended writing tasks.

R2-Write: Reflection and Revision for Open-Ended Writing with Deep Reasoning

Abstract

While deep reasoning with long chain-of-thought has dramatically improved large language models in verifiable domains like mathematics, its effectiveness for open-ended tasks such as writing remains unexplored. In this paper, we conduct a systematic investigation revealing that existing mainstream reasoning models achieve limited gains on open-ended writing tasks. Our further analysis shows that these models lack deep reflection and revision patterns in open-ended writing, resulting in substantially smaller improvements compared to mathematical reasoning tasks. To address this limitation, we introduce R2-Write: an automated framework that synthesizes high-quality thinking trajectories enriched with explicit reflection and revision patterns through iterative writer-judge interaction. To prevent redundant reflections, we design a process reward mechanism that supervises reflection quality during reinforcement learning, improving both performance and token efficiency. Extensive experiments across multiple creative writing and deep-research benchmarks demonstrate significant improvements, validating that explicitly incorporating reflection and revision patterns unlocks deep reasoning capabilities for open-ended writing tasks.

Paper Structure

This paper contains 67 sections, 8 equations, 20 figures, 11 tables, 1 algorithm.

Figures (20)

  • Figure 1: Thinking pattern analysis. The first row shows the pattern distributions of three reasoning models on WritingBench and MATH500. The second row reports, for each model and task, the proportion of patterns that are judged to be helpful for obtaining the correct (or high-scoring) answer. All pattern annotations are obtained using Claude-4.5-Sonnet anthropic2025claude.
  • Figure 2: Overview of the R2-Write pipeline, which consists of three main parts: query data selection, data creation and RL.
  • Figure 3: Token length distribution of thinking trajectories across different methods on Writing Bench.
  • Figure 4: Data distribution of our constructed training set, which includes both SFT and RL data. (a) shows the domain distribution for creative writing tasks, and (b) shows the category distribution for report generating tasks.
  • Figure 5: Effectiveness of reflection pattern usage. We categorize cases where reflection patterns are triggered into three outcomes: Win (R2-Write outperforms baseline), Tie (comparable performance), and Lose (baseline outperforms R2-Write). DG represents Deepresearch Gym, and WB means Writing Bench. The vast majority of reflection instances lead to performance improvements, demonstrating that the model effectively leverages these patterns rather than applying them indiscriminately.
  • ...and 15 more figures