Table of Contents
Fetching ...

Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions

Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Yanhao Li, Bin Cui, Wentao Zhang

TL;DR

This paper identifies a fundamental limit of purely RL-based fine-tuning in LLM reasoning: it struggles to acquire information beyond a model's initial capabilities. It proposes ReLIFT, a framework that interleaves reinforcement learning with online fine-tuning on the hardest questions, using a dynamic buffer of high-quality demonstrations. Empirically, ReLIFT achieves state-of-the-art results on multiple mathematical reasoning benchmarks and outperforms pure RL, pure SFT, and existing hybrids while using far less demonstration data. The approach demonstrates robustness across base models and indicates that selective online fine-tuning on challenging cases can meaningfully extend LLM reasoning beyond current cognitive constraints.

Abstract

Recent advances in large language model (LLM) reasoning have shown that sophisticated behaviors such as planning and self-reflection can emerge through reinforcement learning (RL). However, despite these successes, RL in its current form remains insufficient to induce capabilities that exceed the limitations of the base model, as it is primarily optimized based on existing knowledge of the model rather than facilitating the acquisition of new information. To address this limitation, we employ supervised fine-tuning (SFT) to learn what RL cannot, which enables the incorporation of new knowledge and reasoning patterns by leveraging high-quality demonstration data. We analyze the training dynamics of RL and SFT for LLM reasoning and find that RL excels at maintaining and improving performance on questions within the model's original capabilities, while SFT is more effective at enabling progress on questions beyond the current scope of the model. Motivated by the complementary strengths of RL and SFT, we introduce a novel training approach, \textbf{ReLIFT} (\textbf{Re}inforcement \textbf{L}earning \textbf{I}nterleaved with Online \textbf{F}ine-\textbf{T}uning). In ReLIFT, the model is primarily trained using RL, but when it encounters challenging questions, high-quality solutions are collected for fine-tuning, and the training process alternates between RL and fine-tuning to enhance the model's reasoning abilities. ReLIFT achieves an average improvement of over +5.2 points across five competition-level benchmarks and one out-of-distribution benchmark compared to other zero-RL models. Furthermore, we demonstrate that ReLIFT outperforms both RL and SFT while using only 13\% of the detailed demonstration data, highlighting its scalability. These results provide compelling evidence that ReLIFT overcomes the fundamental limitations of RL and underscores the significant potential.

Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions

TL;DR

This paper identifies a fundamental limit of purely RL-based fine-tuning in LLM reasoning: it struggles to acquire information beyond a model's initial capabilities. It proposes ReLIFT, a framework that interleaves reinforcement learning with online fine-tuning on the hardest questions, using a dynamic buffer of high-quality demonstrations. Empirically, ReLIFT achieves state-of-the-art results on multiple mathematical reasoning benchmarks and outperforms pure RL, pure SFT, and existing hybrids while using far less demonstration data. The approach demonstrates robustness across base models and indicates that selective online fine-tuning on challenging cases can meaningfully extend LLM reasoning beyond current cognitive constraints.

Abstract

Recent advances in large language model (LLM) reasoning have shown that sophisticated behaviors such as planning and self-reflection can emerge through reinforcement learning (RL). However, despite these successes, RL in its current form remains insufficient to induce capabilities that exceed the limitations of the base model, as it is primarily optimized based on existing knowledge of the model rather than facilitating the acquisition of new information. To address this limitation, we employ supervised fine-tuning (SFT) to learn what RL cannot, which enables the incorporation of new knowledge and reasoning patterns by leveraging high-quality demonstration data. We analyze the training dynamics of RL and SFT for LLM reasoning and find that RL excels at maintaining and improving performance on questions within the model's original capabilities, while SFT is more effective at enabling progress on questions beyond the current scope of the model. Motivated by the complementary strengths of RL and SFT, we introduce a novel training approach, \textbf{ReLIFT} (\textbf{Re}inforcement \textbf{L}earning \textbf{I}nterleaved with Online \textbf{F}ine-\textbf{T}uning). In ReLIFT, the model is primarily trained using RL, but when it encounters challenging questions, high-quality solutions are collected for fine-tuning, and the training process alternates between RL and fine-tuning to enhance the model's reasoning abilities. ReLIFT achieves an average improvement of over +5.2 points across five competition-level benchmarks and one out-of-distribution benchmark compared to other zero-RL models. Furthermore, we demonstrate that ReLIFT outperforms both RL and SFT while using only 13\% of the detailed demonstration data, highlighting its scalability. These results provide compelling evidence that ReLIFT overcomes the fundamental limitations of RL and underscores the significant potential.

Paper Structure

This paper contains 20 sections, 7 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Accuracy and response length changes for Easy, Medium, Hard, and Hardest questions during RL and SFT training. (a) Average accuracy change for each difficulty category. (b) Average response length change for each difficulty category. (c) Number of questions transitioning between different initial and final accuracy categories in RL and SFT, respectively. The x-axis represents the initial difficulty category, and the y-axis represents the final difficulty category.
  • Figure 2: Overview of the ReLIFT Training Framework. The model is mainly trained with RL. When it encounters particularly hard questions, high-quality solutions are collected or generated, then stored in a buffer. Once enough hard examples are gathered, a fine-tuning (FT) step is performed using these examples. This process adaptively alternates between RL and FT to help the model learn from its mistakes and improve reasoning ability. In addition, $N$ denotes the number of hardest$(q, s)$ pairs in the buffer, while $M$ represents a predefined threshold, typically set to the batch size for the FT.
  • Figure 3: Training Dynamic of rewards, response lengths, and the training entropy during RL and ReLIFT training.
  • Figure 4: Ablation study on ReLIFT. The left bar and the right bar represents average accuracy and length, respectively.
  • Figure 5: Accuracy versus the choice of $\alpha$.
  • ...and 3 more figures