Document Reconstruction Unlocks Scalable Long-Context RLVR
Yao Xiao, Lei Wang, Yue Deng, Guanzheng Chen, Ziqi Jin, Jung-jae Kim, Xiaoli Li, Roy Ka-wei Lee, Lidong Bing
TL;DR
This paper introduces an unsupervised RLVR framework that relies on document reconstruction to train LLMs for long-context reasoning without gold-standard supervision. By masking paragraphs in long documents and requiring ordered reconstruction from candidate options, the method yields verifiable rewards and leverages GRPO for stable, value-free policy optimization. Across RULER and LongBench v2, the approach delivers substantial gains on long-context tasks and demonstrates robustness to reward design and data scaling, suggesting a scalable alternative to supervised long-context training. The work highlights the document structure itself as a valuable supervisory signal and lays groundwork for unsupervised, self-supervised objectives that enhance long-range understanding in LLMs.
Abstract
Reinforcement Learning with Verifiable Rewards~(RLVR) has become a prominent paradigm to enhance the capabilities (i.e.\ long-context) of Large Language Models~(LLMs). However, it often relies on gold-standard answers or explicit evaluation rubrics provided by powerful teacher models or human experts, which are costly and time-consuming. In this work, we investigate unsupervised approaches to enhance the long-context capabilities of LLMs, eliminating the need for heavy human annotations or teacher models' supervision. Specifically, we first replace a few paragraphs with special placeholders in a long document. LLMs are trained through reinforcement learning to reconstruct the document by correctly identifying and sequencing missing paragraphs from a set of candidate options. This training paradigm enables the model to capture global narrative coherence, significantly boosting long-context performance. We validate the effectiveness of our method on two widely used benchmarks, RULER and LongBench~v2. While acquiring noticeable gains on RULER, it can also achieve a reasonable improvement on LongBench~v2 without any manually curated long-context QA data. Furthermore, we conduct extensive ablation studies to analyze the impact of reward design, data curation strategies, training schemes, and data scaling effects on model performance. We publicly release our code, data, and models.
