Learn Hard Problems During RL with Reference Guided Fine-tuning

Yangzhen Wu; Shanda Li; Zixin Wen; Xin Zhou; Ameet Talwalkar; Yiming Yang; Wenhao Huang; Tianle Cai

Learn Hard Problems During RL with Reference Guided Fine-tuning

Yangzhen Wu, Shanda Li, Zixin Wen, Xin Zhou, Ameet Talwalkar, Yiming Yang, Wenhao Huang, Tianle Cai

TL;DR

Reference-Guided Fine-Tuning (ReGFT) is introduced, a simple and effective method that utilizes human-written reference solutions to synthesize positive trajectories on hard problems and train on them before RL, which effectively overcomes reward sparsity and unlocks stronger RL-based mathematical reasoning.

Abstract

Reinforcement learning (RL) for mathematical reasoning can suffer from reward sparsity: for challenging problems, LLM fails to sample any correct trajectories, preventing RL from receiving meaningful positive feedback. At the same time, there often exist human-written reference solutions along with the problem (e.g., problems from AoPS), but directly fine-tuning on these solutions offers no benefit because models often cannot imitate human proofs that lie outside their own reasoning distribution. We introduce Reference-Guided Fine-Tuning (ReGFT), a simple and effective method that utilizes human-written reference solutions to synthesize positive trajectories on hard problems and train on them before RL. For each problem, we provide the model with a partial reference solution and let it generate its own reasoning trace, ensuring the resulting trajectories remain in the model's reasoning space while still benefiting from reference guidance. Fine-tuning on these reference-guided trajectories increases the number of solvable problems and produces a checkpoint that receives more positive rewards during RL. Across three benchmarks (AIME24, AIME25, BeyondAIME), ReGFT consistently improves supervised accuracy, accelerates DAPO training, and raises the final performance plateau of RL. Our results show that ReGFT effectively overcomes reward sparsity and unlocks stronger RL-based mathematical reasoning.

Learn Hard Problems During RL with Reference Guided Fine-tuning

TL;DR

Abstract

Paper Structure (25 sections, 9 equations, 5 figures, 2 tables)

This paper contains 25 sections, 9 equations, 5 figures, 2 tables.

Introduction
Related Work
Scaling RL and Adaptive Sampling.
Question Augmentation during RL.
Interleaving SFT and RL.
Our Difference.
Approach
On-policy finetuning with model-generated trajectories (ReFT).
Reference-Guided Finetuning for Unsolved Problems (ReGFT).
Reinforcement learning from improved initialization
Experiments
Experimental Setup
Main Results
Reference-Guided Finetuning Enhances RL Training
Impact of reference-Guided Demonstrations
...and 10 more sections

Figures (5)

Figure 1: Comparison of ReFT and ReGFT. Top: ReFT fine-tunes the model using verified correct trajectories obtained from standard sampling. Bottom: ReGFT additionally applies reference-guided sampling to recover hard problems without correct trajectories.
Figure 2: Reinforcement learning performance over training steps on three challenging benchmarks. Models initialized with ReGFT consistently achieve higher accuracy, faster convergence, and a superior final performance compared to the raw checkpoint, demonstrating that reference-guided fine-tuning provides a stronger initialization for RL.
Figure 3: Reinforcement learning performance comparison between ReFT and ReGFT across three benchmarks. While both initializations accelerate early-stage RL, ReGFT consistently achieves higher final accuracy, highlighting the contribution of reference-guided demonstrations beyond self-generated trajectories.
Figure 4: Accuracy comparison between direct SFT on raw human reference solutions and ReGFT. Models trained directly on reference solutions fail to achieve competitive RL performance, underscoring the importance of model-derived reasoning trajectories.
Figure 5: Inference-time scaling performance (pass@k) of raw and RL-trained checkpoints across benchmarks. Solid lines denote DAPO-trained models, while dashed line indicate the raw pre-RL checkpoint.

Learn Hard Problems During RL with Reference Guided Fine-tuning

TL;DR

Abstract

Learn Hard Problems During RL with Reference Guided Fine-tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)