Table of Contents
Fetching ...

Beyond Two-Stage Training: Cooperative SFT and RL for LLM Reasoning

Liang Chen, Xueting Han, Li Shen, Jing Bai, Kam-Fai Wong

TL;DR

The paper tackles inefficiencies in reinforcement learning for LLM reasoning, notably the memory-forgetting and exploration issues arising from traditional two-stage SFT+RL pipelines. It introduces BRIDGE, a cooperative meta-learning framework that casts SFT as the upper-level objective and RL as the lower-level objective within a bilevel optimization, using an augmented base+LoRA model and a penalty-based relaxation to maximize the cooperative gain over RL alone. Empirical results on three LLMs across five mathematical reasoning benchmarks (including MATH500, Minerva Math, OlympiadBench, AIME, AMC) show BRIDGE consistently outperforms baselines in accuracy and training efficiency, with strong generalization to out-of-domain tasks. The approach also demonstrates favorable cost-benefit trade-offs and remains robust to LoRA hyperparameters, underscoring the practical impact of tightly integrating imitation and exploration for complex reasoning tasks.

Abstract

Reinforcement learning (RL) has proven effective in incentivizing the reasoning abilities of large language models (LLMs), but suffers from severe efficiency challenges due to its trial-and-error nature. While the common practice employs supervised fine-tuning (SFT) as a warm-up stage for RL, this decoupled two-stage approach suffers from catastrophic forgetting: second-stage RL gradually loses SFT-acquired behaviors and inefficiently explores new patterns. This study introduces a novel method for learning reasoning models that employs bilevel optimization to facilitate better cooperation between these training paradigms. By conditioning the SFT objective on the optimal RL policy, our approach enables SFT to meta-learn how to guide RL's optimization process. During training, the lower level performs RL updates while simultaneously receiving SFT supervision, and the upper level explicitly maximizes the cooperative gain-the performance advantage of joint SFT-RL training over RL alone. Empirical evaluations on five reasoning benchmarks demonstrate that our method consistently outperforms baselines and achieves a better balance between effectiveness and efficiency.

Beyond Two-Stage Training: Cooperative SFT and RL for LLM Reasoning

TL;DR

The paper tackles inefficiencies in reinforcement learning for LLM reasoning, notably the memory-forgetting and exploration issues arising from traditional two-stage SFT+RL pipelines. It introduces BRIDGE, a cooperative meta-learning framework that casts SFT as the upper-level objective and RL as the lower-level objective within a bilevel optimization, using an augmented base+LoRA model and a penalty-based relaxation to maximize the cooperative gain over RL alone. Empirical results on three LLMs across five mathematical reasoning benchmarks (including MATH500, Minerva Math, OlympiadBench, AIME, AMC) show BRIDGE consistently outperforms baselines in accuracy and training efficiency, with strong generalization to out-of-domain tasks. The approach also demonstrates favorable cost-benefit trade-offs and remains robust to LoRA hyperparameters, underscoring the practical impact of tightly integrating imitation and exploration for complex reasoning tasks.

Abstract

Reinforcement learning (RL) has proven effective in incentivizing the reasoning abilities of large language models (LLMs), but suffers from severe efficiency challenges due to its trial-and-error nature. While the common practice employs supervised fine-tuning (SFT) as a warm-up stage for RL, this decoupled two-stage approach suffers from catastrophic forgetting: second-stage RL gradually loses SFT-acquired behaviors and inefficiently explores new patterns. This study introduces a novel method for learning reasoning models that employs bilevel optimization to facilitate better cooperation between these training paradigms. By conditioning the SFT objective on the optimal RL policy, our approach enables SFT to meta-learn how to guide RL's optimization process. During training, the lower level performs RL updates while simultaneously receiving SFT supervision, and the upper level explicitly maximizes the cooperative gain-the performance advantage of joint SFT-RL training over RL alone. Empirical evaluations on five reasoning benchmarks demonstrate that our method consistently outperforms baselines and achieves a better balance between effectiveness and efficiency.

Paper Structure

This paper contains 30 sections, 12 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Training dynamics of mean reward and response length on Qwen2.5-3B.
  • Figure 2: Comparison of Training Methods.
  • Figure 3: Comparison of two training methods.