Document Reconstruction Unlocks Scalable Long-Context RLVR

Yao Xiao; Lei Wang; Yue Deng; Guanzheng Chen; Ziqi Jin; Jung-jae Kim; Xiaoli Li; Roy Ka-wei Lee; Lidong Bing

Document Reconstruction Unlocks Scalable Long-Context RLVR

Yao Xiao, Lei Wang, Yue Deng, Guanzheng Chen, Ziqi Jin, Jung-jae Kim, Xiaoli Li, Roy Ka-wei Lee, Lidong Bing

TL;DR

This paper introduces an unsupervised RLVR framework that relies on document reconstruction to train LLMs for long-context reasoning without gold-standard supervision. By masking paragraphs in long documents and requiring ordered reconstruction from candidate options, the method yields verifiable rewards and leverages GRPO for stable, value-free policy optimization. Across RULER and LongBench v2, the approach delivers substantial gains on long-context tasks and demonstrates robustness to reward design and data scaling, suggesting a scalable alternative to supervised long-context training. The work highlights the document structure itself as a valuable supervisory signal and lays groundwork for unsupervised, self-supervised objectives that enhance long-range understanding in LLMs.

Abstract

Reinforcement Learning with Verifiable Rewards~(RLVR) has become a prominent paradigm to enhance the capabilities (i.e.\ long-context) of Large Language Models~(LLMs). However, it often relies on gold-standard answers or explicit evaluation rubrics provided by powerful teacher models or human experts, which are costly and time-consuming. In this work, we investigate unsupervised approaches to enhance the long-context capabilities of LLMs, eliminating the need for heavy human annotations or teacher models' supervision. Specifically, we first replace a few paragraphs with special placeholders in a long document. LLMs are trained through reinforcement learning to reconstruct the document by correctly identifying and sequencing missing paragraphs from a set of candidate options. This training paradigm enables the model to capture global narrative coherence, significantly boosting long-context performance. We validate the effectiveness of our method on two widely used benchmarks, RULER and LongBench~v2. While acquiring noticeable gains on RULER, it can also achieve a reasonable improvement on LongBench~v2 without any manually curated long-context QA data. Furthermore, we conduct extensive ablation studies to analyze the impact of reward design, data curation strategies, training schemes, and data scaling effects on model performance. We publicly release our code, data, and models.

Document Reconstruction Unlocks Scalable Long-Context RLVR

TL;DR

Abstract

Paper Structure (30 sections, 5 equations, 9 figures, 3 tables)

This paper contains 30 sections, 5 equations, 9 figures, 3 tables.

Introduction
Background
Group Relative Policy Optimization (GRPO).
Method
Task Formulation: Document Reconstruction
Reward Design
Curriculum through Complexity Scaling
Experimental Setup
Data Curation
Training.
Evaluation.
Models and Baselines.
Results and Analysis
Main Results
Dense vs. Sparse Reward
...and 15 more sections

Figures (9)

Figure 1: We report the average score of RULER and overall score of LongBench v2 for Qwen2.5-7B-Instruct-1M and LLaMA-3.1-8B-Instruct.
Figure 2: Overview of the document reconstruction framework. Given a long document, we corrupt it by selecting some paragraphs and shuffle them as options. We train LLMs via RLVR to reconstruct the document by generating the option sequence by order.
Figure 3: Performance comparison across different context lengths and models.
Figure 4: Average scores of RULER. We compare performance of dense and sparse rewards.
Figure 5: Average scores of RULER. We compare the performance of different option length mixture ratios.
...and 4 more figures

Document Reconstruction Unlocks Scalable Long-Context RLVR

TL;DR

Abstract

Document Reconstruction Unlocks Scalable Long-Context RLVR

Authors

TL;DR

Abstract

Table of Contents

Figures (9)