Masked-and-Reordered Self-Supervision for Reinforcement Learning from Verifiable Rewards
Zhen Wang, Zhifeng Gao, Guolin Ke
TL;DR
MR-RLVR addresses the challenge of guiding multi-step mathematical reasoning under verifiable-final rewards by introducing two process-level self-supervised tasks—Masked-Then-Fill and Step Reordering—that generate dense, automatic guidance from reasoning trajectories. It adopts a two-stage training regime: first shaping the policy with dense process-level rewards, then fine-tuning under outcome-only, verifiable rewards, yielding improved stability and scalability. Across models and benchmarks (e.g., AIME, AMC, MATH500), MR-RLVR achieves consistent gains over a GRPO baseline, with notable data-efficiency advantages in low-data regimes. The approach offers a practical path to more reliable reasoning in LLMs and can extend to broader structured reasoning and multimodal tasks.
Abstract
Test-time scaling has been shown to substantially improve large language models' (LLMs) mathematical reasoning. However, for a large portion of mathematical corpora, especially theorem proving, RLVR's scalability is limited: intermediate reasoning is crucial, while final answers are difficult to directly and reliably verify. Meanwhile, token-level SFT often degenerates into rote memorization rather than inducing longer chains of thought. Inspired by BERT's self-supervised tasks, we propose MR-RLVR (Masked-and-Reordered RLVR), which constructs process-level self-supervised rewards via "masked-then-fill" and "step reordering" to extract learnable signals from intermediate reasoning. Our training pipeline comprises two stages: we first perform self-supervised training on sampled mathematical calculation and proof data; we then conduct RLVR fine-tuning on mathematical calculation datasets where only outcomes are verifiable. We implement MR-RLVR on Qwen2.5-3B and DeepSeek-R1-Distill-Qwen-1.5B, and evaluate on AIME24, AIME25, AMC23, and MATH500. Under a fixed sampling and decoding budget, MR-RLVR achieves average relative gains over the original RLVR of +9.86% Pass@1, +5.27% Pass@5, and +4.00% Pass@8. These results indicate that incorporating process-aware self-supervised signals can effectively enhance RLVR's scalability and performance in only outcome-verifiable settings.
