Table of Contents
Fetching ...

Supervised Fine-Tuning versus Reinforcement Learning: A Study of Post-Training Methods for Large Language Models

Haitao Jiang, Wenbo Zhang, Jiarui Yao, Hengrui Cai, Sheng Wang, Rui Song

Abstract

Pre-trained Large Language Model (LLM) exhibits broad capabilities, yet, for specific tasks or domains their attainment of higher accuracy and more reliable reasoning generally depends on post-training through Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL). Although often treated as distinct methodologies, recent theoretical and empirical developments demonstrate that SFT and RL are closely connected. This study presents a comprehensive and unified perspective on LLM post-training with SFT and RL. We first provide an in-depth overview of both techniques, examining their objectives, algorithmic structures, and data requirements. We then systematically analyze their interplay, highlighting frameworks that integrate SFT and RL, hybrid training pipelines, and methods that leverage their complementary strengths. Drawing on a representative set of recent application studies from 2023 to 2025, we identify emerging trends, characterize the rapid shift toward hybrid post-training paradigms, and distill key takeaways that clarify when and why each method is most effective. By synthesizing theoretical insights, practical methodologies, and empirical evidence, this study establishes a coherent understanding of SFT and RL within a unified framework and outlines promising directions for future research in scalable, efficient, and generalizable LLM post-training.

Supervised Fine-Tuning versus Reinforcement Learning: A Study of Post-Training Methods for Large Language Models

Abstract

Pre-trained Large Language Model (LLM) exhibits broad capabilities, yet, for specific tasks or domains their attainment of higher accuracy and more reliable reasoning generally depends on post-training through Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL). Although often treated as distinct methodologies, recent theoretical and empirical developments demonstrate that SFT and RL are closely connected. This study presents a comprehensive and unified perspective on LLM post-training with SFT and RL. We first provide an in-depth overview of both techniques, examining their objectives, algorithmic structures, and data requirements. We then systematically analyze their interplay, highlighting frameworks that integrate SFT and RL, hybrid training pipelines, and methods that leverage their complementary strengths. Drawing on a representative set of recent application studies from 2023 to 2025, we identify emerging trends, characterize the rapid shift toward hybrid post-training paradigms, and distill key takeaways that clarify when and why each method is most effective. By synthesizing theoretical insights, practical methodologies, and empirical evidence, this study establishes a coherent understanding of SFT and RL within a unified framework and outlines promising directions for future research in scalable, efficient, and generalizable LLM post-training.
Paper Structure (35 sections, 6 equations, 3 figures, 4 tables)

This paper contains 35 sections, 6 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: A taxonomy of large language model (LLM) post-training alignment methods via supervised fine-tuning (SFT) and reinforcement learning (RL). We organize prior work along three categories: (1) algorithm-centric versus data-centric approaches within SFT and RL, (2) comparative, unifying, and hybrid frameworks that integrate SFT and RL objectives, and (3) representative downstream application domains, including reasoning, mathematics, agentic behavior, and code-related tasks.
  • Figure 2: Objective comparison across training paradigms. Here for integration, (SFT $\rightarrow$) RL means modifying RL objective based on SFT objective, and vice versa.
  • Figure 3: Trends in task focus, training methodologies, and ground-truth data sources from 2023 to 2025. Results show rapid growth across all surveyed domains, with substantial increases in research volume and diversification of application areas; increasing convergence toward hybrid SFT–RL pipelines, supported by more mature training infrastructures, libraries, and preference datasets; and a continued shift from API-based labeling to data generated with increasingly capable open-weight models. Projections for 2025 and all reported proportions are derived from surveyed publications; see Appendix \ref{['appendix:fig_dis']} for further discussion.