Mitigating Forgetting Between Supervised and Reinforcement Learning Yields Stronger Reasoners
Xiangchi Yuan, Xiang Chen, Tong Yu, Dachuan Shi, Can Jin, Wenke Lee, Saayan Mitra
TL;DR
The paper tackles forgetting and data inefficiency when combining supervised fine-tuning (SFT) with reinforcement learning (RL) for reasoning tasks in large language models. It introduces MIFO, a plug-and-play framework with two pillars: (1) data processing that interleaves RL with SFT by curating challenging examples and focusing SFT loss on high-entropy tokens, and (2) parameter freezing that protects RL-critical parameters during SFT to reduce forgetting. Empirically, MIFO achieves state-of-the-art reasoning performance on math benchmarks for both 1.5B and 7B models while using only a fraction of the data required by prior methods ($1.5\%$ SFT data and $20.4\%$ RL data vs prior SoTA), and generates more concise reasoning traces. The approach is algorithm-agnostic and demonstrates strong data efficiency and robustness to template shifts, highlighting practical impact for scalable, reliable reasoning post-training. Theoretical insights also suggest why SFT tends to be more redundant than RL updates, informing effective forgetting mitigation strategies.
Abstract
Large Language Models (LLMs) show strong reasoning abilities, often amplified by Chain-of-Thought (CoT) prompting and reinforcement learning (RL). Although RL algorithms can substantially improve reasoning, they struggle to expand reasoning boundaries because they learn from their own reasoning trajectories rather than acquiring external knowledge. Supervised fine-tuning (SFT) offers complementary benefits but typically requires large-scale data and risks overfitting. Recent attempts to combine SFT and RL face three main challenges: data inefficiency, algorithm-specific designs, and catastrophic forgetting. We propose a plug-and-play framework that dynamically integrates SFT into RL by selecting challenging examples for SFT. This approach reduces SFT data requirements and remains agnostic to the choice of RL or SFT algorithm. To mitigate catastrophic forgetting of RL-acquired skills during SFT, we select high-entropy tokens for loss calculation and freeze parameters identified as critical for RL. Our method achieves state-of-the-art (SoTA) reasoning performance using only 1.5% of the SFT data and 20.4% of the RL data used by prior SoTA, providing an efficient and plug-and-play solution for combining SFT and RL in reasoning post-training.
