Table of Contents
Fetching ...

Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead

Feiyang Kang, Michael Kuchnik, Karthik Padthe, Marin Vlastelica, Ruoxi Jia, Carole-Jean Wu, Newsha Ardalani

TL;DR

This work questions the common assumption that higher SFT scores reliably predict stronger post-RL performance in reasoning LLMs. Through extensive experiments across models up to 12B parameters and multiple datasets, the authors show SFT scores can overfit to simpler or homogeneous data and fail to forecast RL outcomes. They introduce two predictors—Generalization Loss on SFT validation and Pass@k at large k—that demonstrate substantially higher predictive power for RL success (up to $R^2$ gains of ~0.5 and Spearman gains up to ~0.9) compared to post-SFT metrics. The findings hold across dataset-level and instance-level settings, guiding data selection and training strategies to de-risk the expensive RL phase. An enhanced evaluation tool is open-sourced to facilitate broader adoption and further research in reliable RL outcome prediction for post-training pipelines.

Abstract

In post-training for reasoning Large Language Models (LLMs), the current state of practice trains LLMs in two independent stages: Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR, shortened as ``RL'' below). In this work, we challenge whether high SFT scores translate to improved performance after RL. We provide extensive counter-examples where this is not true. We find high SFT scores can be biased toward simpler or more homogeneous data and are not reliably predictive of subsequent RL gains or scaled-up post-training effectiveness. In some cases, RL training on models with improved SFT performance could lead to substantially worse outcome compared to RL on the base model without SFT. We study alternative metrics and identify generalization loss on held-out reasoning examples and Pass@large k performance to provide strong proxies for the RL outcome. We trained hundreds of models up to 12B-parameter with SFT and RLVR via GRPO and ran extensive evaluations on 7 math benchmarks with up to 256 repetitions, spending $>$1M GPU hours. Experiments include models from Llama3, Mistral-Nemo, Qwen3 and multiple state-of-the-art SFT/RL datasets. Compared to directly predicting from pre-RL performance, prediction based on generalization loss and Pass@large k achieves substantial higher precision, improving $R^2$ coefficient and Spearman's rank correlation coefficient by up to 0.5 (2x). This provides strong utility for broad use cases. For example, in most experiments, we find SFT training on unique examples for a one epoch underperforms training on half examples for two epochs, either after SFT or SFT-then-RL; With the same SFT budget, training only on short examples may lead to better SFT performance, though, it often leads to worse outcome after RL compared to training on examples with varying lengths. Evaluation tool will be open-sourced.

Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead

TL;DR

This work questions the common assumption that higher SFT scores reliably predict stronger post-RL performance in reasoning LLMs. Through extensive experiments across models up to 12B parameters and multiple datasets, the authors show SFT scores can overfit to simpler or homogeneous data and fail to forecast RL outcomes. They introduce two predictors—Generalization Loss on SFT validation and Pass@k at large k—that demonstrate substantially higher predictive power for RL success (up to gains of ~0.5 and Spearman gains up to ~0.9) compared to post-SFT metrics. The findings hold across dataset-level and instance-level settings, guiding data selection and training strategies to de-risk the expensive RL phase. An enhanced evaluation tool is open-sourced to facilitate broader adoption and further research in reliable RL outcome prediction for post-training pipelines.

Abstract

In post-training for reasoning Large Language Models (LLMs), the current state of practice trains LLMs in two independent stages: Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR, shortened as ``RL'' below). In this work, we challenge whether high SFT scores translate to improved performance after RL. We provide extensive counter-examples where this is not true. We find high SFT scores can be biased toward simpler or more homogeneous data and are not reliably predictive of subsequent RL gains or scaled-up post-training effectiveness. In some cases, RL training on models with improved SFT performance could lead to substantially worse outcome compared to RL on the base model without SFT. We study alternative metrics and identify generalization loss on held-out reasoning examples and Pass@large k performance to provide strong proxies for the RL outcome. We trained hundreds of models up to 12B-parameter with SFT and RLVR via GRPO and ran extensive evaluations on 7 math benchmarks with up to 256 repetitions, spending 1M GPU hours. Experiments include models from Llama3, Mistral-Nemo, Qwen3 and multiple state-of-the-art SFT/RL datasets. Compared to directly predicting from pre-RL performance, prediction based on generalization loss and Pass@large k achieves substantial higher precision, improving coefficient and Spearman's rank correlation coefficient by up to 0.5 (2x). This provides strong utility for broad use cases. For example, in most experiments, we find SFT training on unique examples for a one epoch underperforms training on half examples for two epochs, either after SFT or SFT-then-RL; With the same SFT budget, training only on short examples may lead to better SFT performance, though, it often leads to worse outcome after RL compared to training on examples with varying lengths. Evaluation tool will be open-sourced.

Paper Structure

This paper contains 29 sections, 1 equation, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Mistral-NeMo-12B-Instruct undergone SFT-RL with SFT examples from AceReasoner1.1-SFT dataset and RLVR via GRPO on DeepScaleR dataset. Reporting Pass@1 performance averaged over 7 math benchmarks. When training on Random/Longest/Shortest SFT examples, the final performance after RL increases at different rates than the SFT performance. Model with the best SFT performance is not the one with the best final performance after RL. Post-SFT and SFT+RL performance correlate, though, optimizing post-SFT performance might not optimize the final performance after RL.
  • Figure 2: Both models undergone SFT-RL with SFT examples from AceReasoner1.1-SFT dataset and RLVR via GRPO on DeepScaleR dataset. Reporting Pass@1 performance averaged over 7 math benchmarks. Even with identical post-training procedures, different models may respond vastly different. With increasing SFT examples, Mistral's (left) post-SFT performance and final performance after RL both increase. Yet, for Qwen3 models (right), the post-SFT performances appear uncorrelated with the final performance after RL, where the latter remains the same despite the substantially improved SFT performance.
  • Figure 3: Llama3-8B-Instruct undergone SFT-RL with SFT examples from Llama-Nemotron-SFT dataset and RLVR via GRPO on MATH dataset (train-split). Reporting Pass@1 performance averaged over 7 math benchmarks. Linear fit between post-SFT performance and final outcome after RL. The two performance correlates with $R^2=0.43$, indicating post-SFT performance explains only 43% of variation in the final outcome after RL and the remaining gaps are prominent.
  • Figure 4: Both models undergone SFT-RL with SFT examples from AceReasoner1.1-SFT dataset and RLVR via GRPO on DeepScaleR dataset. Reporting Pass@1 performance averaged over 7 math benchmarks. When repeating SFT for more epochs on the same data, Mistral's (left) SFT continues to improve with up to 4 epochs where the final performance after RL saturates after 2 epochs. Qwen3's (right) final performance after RL degrades with SFT training, though, these models' post-SFT performance is substantially higher than the base model. Both cases show clear divergence between post-SFT performance and final performance after RL. Here, optimizing post-SFT performance will be suboptimal or ineffective for improving the final model.
  • Figure 5: Llama3-8B-Instruct undergone SFT-RL with SFT examples from Llama-Nemotron-SFT dataset and RLVR via GRPO on MATH dataset (train-split). Reporting Pass@1 performance averaged over 7 math benchmarks and generalization loss on the validation set of SFT data. We identify generalization loss after SFT to be a viable indicator for the model's RL potential. While repeating training for more epochs, together with the improving post-SFT performance, we observe the generalization loss on validation examples to elevate and eventually flare up, indicating strong over-fitting. This generalization loss shows strong correlation with the further performance gain during the subsequent RL, allowing prediction for the final outcome after RL.
  • ...and 5 more figures