Table of Contents
Fetching ...

LongDPO: Unlock Better Long-form Generation Abilities for LLMs via Critique-augmented Stepwise Information

Bowen Ping, Jiali Zeng, Fandong Meng, Shuo Wang, Jie Zhou, Shanghang Zhang

TL;DR

LongDPO addresses the persistent challenge of high-quality long-form generation by introducing stepwise, process-supervised learning. It leverages Monte Carlo Tree Search to collect stepwise preferences, a global memory pool to uphold factual consistency, and critique-augmented candidate refinement, followed by stepwise DPO training. Empirical results on LongBench-Write-en and LongGenBench across Llama and Qwen backbones show improved length adherence and writing quality with near-lossless general-task performance, and ablations confirm the effectiveness of the memory pool and external critiques. This work advancing long-horizon generation exposes a practical pathway for applying granular process supervision to complex, extended text tasks beyond traditional outcome-focused feedback.

Abstract

Long-form generation is crucial for academic writing papers and repo-level code generation. Despite this, current models, including GPT-4o, still exhibit unsatisfactory performance. Existing methods that utilize preference learning with outcome supervision often fail to provide detailed feedback for extended contexts. This shortcoming can lead to content that does not fully satisfy query requirements, resulting in issues like length deviations, and diminished quality. In this paper, we propose enhancing long-form generation by incorporating process supervision. We employ Monte Carlo Tree Search to gather stepwise preference pairs, utilizing a global memory pool to maintain consistency. To address the issue of suboptimal candidate selection, we integrate external critiques to refine and improve the quality of the preference pairs. Finally, we apply step-level DPO using the collected stepwise preference pairs. Experimental results show that our method improves length and quality on long-form generation benchmarks, with almost lossless performance on general benchmarks across various model backbones.

LongDPO: Unlock Better Long-form Generation Abilities for LLMs via Critique-augmented Stepwise Information

TL;DR

LongDPO addresses the persistent challenge of high-quality long-form generation by introducing stepwise, process-supervised learning. It leverages Monte Carlo Tree Search to collect stepwise preferences, a global memory pool to uphold factual consistency, and critique-augmented candidate refinement, followed by stepwise DPO training. Empirical results on LongBench-Write-en and LongGenBench across Llama and Qwen backbones show improved length adherence and writing quality with near-lossless general-task performance, and ablations confirm the effectiveness of the memory pool and external critiques. This work advancing long-horizon generation exposes a practical pathway for applying granular process supervision to complex, extended text tasks beyond traditional outcome-focused feedback.

Abstract

Long-form generation is crucial for academic writing papers and repo-level code generation. Despite this, current models, including GPT-4o, still exhibit unsatisfactory performance. Existing methods that utilize preference learning with outcome supervision often fail to provide detailed feedback for extended contexts. This shortcoming can lead to content that does not fully satisfy query requirements, resulting in issues like length deviations, and diminished quality. In this paper, we propose enhancing long-form generation by incorporating process supervision. We employ Monte Carlo Tree Search to gather stepwise preference pairs, utilizing a global memory pool to maintain consistency. To address the issue of suboptimal candidate selection, we integrate external critiques to refine and improve the quality of the preference pairs. Finally, we apply step-level DPO using the collected stepwise preference pairs. Experimental results show that our method improves length and quality on long-form generation benchmarks, with almost lossless performance on general benchmarks across various model backbones.

Paper Structure

This paper contains 30 sections, 8 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: The above refers to outcome supervision, which directly provides feedback for extended sequences in long-form generation tasks. Below is LongDPO uses process supervision with a global memory to maintain factual consistency, and external critiques to refine low-reward chosen candidates.
  • Figure 2: The pipeline of LongDPO. LongDPO incorporates process supervision and MCTS to collect stepwise preference data. During the selection phase, LongDPO uses the global memory pool to filter out candidates that may result in inconsistency, then selects the highest-scoring one as the chosen candidate, with another randomly selected as the rejected candidate. During tree expansion, LongDPO leverages external critiques only for low-reward chosen candidates. Then the collected preference pairs are used for step-level DPO training.
  • Figure 3: Main body of generated critiques which have detailed in Appedix \ref{['refine_template']}
  • Figure 4: A case is randomly sampled from LongGenBench. The instruction primarily requires visiting the farmers' market starting from week 10 and then every 5 weeks thereafter. On the left, LongWriter-Llama fulfills the requirement in week 10 but fails in week 15. On the right, after applying LongDPO, LongWriter-Llama is able to consistently meet the demands.
  • Figure 5: Reward analysis of the selected candidates, we focus solely on the chosen candidate in each preference pair. On the x-axis, '0-3.0' represents the proportion of candidates with an average reward $< 3.0$, while '3.0-3.5' represents the proportion of candidates with an average reward $\geq 3.0$ but $< 3.5$. Detailed reward distribution can be found in Appendix \ref{['reward_distribution_full']}.
  • ...and 3 more figures