Table of Contents
Fetching ...

Mitigating Forgetting Between Supervised and Reinforcement Learning Yields Stronger Reasoners

Xiangchi Yuan, Xiang Chen, Tong Yu, Dachuan Shi, Can Jin, Wenke Lee, Saayan Mitra

TL;DR

The paper tackles forgetting and data inefficiency when combining supervised fine-tuning (SFT) with reinforcement learning (RL) for reasoning tasks in large language models. It introduces MIFO, a plug-and-play framework with two pillars: (1) data processing that interleaves RL with SFT by curating challenging examples and focusing SFT loss on high-entropy tokens, and (2) parameter freezing that protects RL-critical parameters during SFT to reduce forgetting. Empirically, MIFO achieves state-of-the-art reasoning performance on math benchmarks for both 1.5B and 7B models while using only a fraction of the data required by prior methods ($1.5\%$ SFT data and $20.4\%$ RL data vs prior SoTA), and generates more concise reasoning traces. The approach is algorithm-agnostic and demonstrates strong data efficiency and robustness to template shifts, highlighting practical impact for scalable, reliable reasoning post-training. Theoretical insights also suggest why SFT tends to be more redundant than RL updates, informing effective forgetting mitigation strategies.

Abstract

Large Language Models (LLMs) show strong reasoning abilities, often amplified by Chain-of-Thought (CoT) prompting and reinforcement learning (RL). Although RL algorithms can substantially improve reasoning, they struggle to expand reasoning boundaries because they learn from their own reasoning trajectories rather than acquiring external knowledge. Supervised fine-tuning (SFT) offers complementary benefits but typically requires large-scale data and risks overfitting. Recent attempts to combine SFT and RL face three main challenges: data inefficiency, algorithm-specific designs, and catastrophic forgetting. We propose a plug-and-play framework that dynamically integrates SFT into RL by selecting challenging examples for SFT. This approach reduces SFT data requirements and remains agnostic to the choice of RL or SFT algorithm. To mitigate catastrophic forgetting of RL-acquired skills during SFT, we select high-entropy tokens for loss calculation and freeze parameters identified as critical for RL. Our method achieves state-of-the-art (SoTA) reasoning performance using only 1.5% of the SFT data and 20.4% of the RL data used by prior SoTA, providing an efficient and plug-and-play solution for combining SFT and RL in reasoning post-training.

Mitigating Forgetting Between Supervised and Reinforcement Learning Yields Stronger Reasoners

TL;DR

The paper tackles forgetting and data inefficiency when combining supervised fine-tuning (SFT) with reinforcement learning (RL) for reasoning tasks in large language models. It introduces MIFO, a plug-and-play framework with two pillars: (1) data processing that interleaves RL with SFT by curating challenging examples and focusing SFT loss on high-entropy tokens, and (2) parameter freezing that protects RL-critical parameters during SFT to reduce forgetting. Empirically, MIFO achieves state-of-the-art reasoning performance on math benchmarks for both 1.5B and 7B models while using only a fraction of the data required by prior methods ( SFT data and RL data vs prior SoTA), and generates more concise reasoning traces. The approach is algorithm-agnostic and demonstrates strong data efficiency and robustness to template shifts, highlighting practical impact for scalable, reliable reasoning post-training. Theoretical insights also suggest why SFT tends to be more redundant than RL updates, informing effective forgetting mitigation strategies.

Abstract

Large Language Models (LLMs) show strong reasoning abilities, often amplified by Chain-of-Thought (CoT) prompting and reinforcement learning (RL). Although RL algorithms can substantially improve reasoning, they struggle to expand reasoning boundaries because they learn from their own reasoning trajectories rather than acquiring external knowledge. Supervised fine-tuning (SFT) offers complementary benefits but typically requires large-scale data and risks overfitting. Recent attempts to combine SFT and RL face three main challenges: data inefficiency, algorithm-specific designs, and catastrophic forgetting. We propose a plug-and-play framework that dynamically integrates SFT into RL by selecting challenging examples for SFT. This approach reduces SFT data requirements and remains agnostic to the choice of RL or SFT algorithm. To mitigate catastrophic forgetting of RL-acquired skills during SFT, we select high-entropy tokens for loss calculation and freeze parameters identified as critical for RL. Our method achieves state-of-the-art (SoTA) reasoning performance using only 1.5% of the SFT data and 20.4% of the RL data used by prior SoTA, providing an efficient and plug-and-play solution for combining SFT and RL in reasoning post-training.

Paper Structure

This paper contains 53 sections, 2 theorems, 29 equations, 13 figures, 5 tables, 1 algorithm.

Key Result

Lemma 1

Fix a context $x$ and target token $y\in\{1,\dots,V\}$ with vocabulary size $V$. Let $\theta\in\mathbb{R}^d$ be model parameters and $\theta_0$ a reference point (e.g., pre-update). For logits $z_\theta(x)\in\mathbb{R}^V$ and probabilities $p_\theta=\mathrm{softmax}(z_\theta)$, define the active com and for any training trajectory with net update $\Delta\theta=\theta_T-\theta_0$, we have decision–

Figures (13)

  • Figure 1: Dropping gradients when updating parameters causes more performance drop on RL.
  • Figure 2: Compared w. SFT, RL has a notable drop when the pruning rate $p_{post}$ increases.
  • Figure 3: SFT induces much more parameter updating on magnitude, compared with RL.
  • Figure 4: MIFO is a training pipeline with multiple connected RL$\to$ SFT intervals. In RL training: it selects data for SFT and decides the important RL parameters at the end; In SFT training: RL-important parameters are frozen, and only high-entropy tokens are used for loss calculation.
  • Figure 5: Average reasoning score vs. response lengths with no template (left) and Qwen template (middle) for 7B model; SFT and RL data usage for training 7B model (right).
  • ...and 8 more figures

Theorems & Definitions (4)

  • Lemma 1
  • proof
  • Theorem 1
  • proof