Table of Contents
Fetching ...

UFT: Unifying Supervised and Reinforcement Fine-Tuning

Mingyang Liu, Gabriele Farina, Asuman Ozdaglar

TL;DR

UFT is proposed, a novel post-training paradigm that unifies SFT and RFT into a single, integrated process that outperforms both SFT and RFT in general, regardless of model sizes.

Abstract

Post-training has demonstrated its importance in enhancing the reasoning capabilities of large language models (LLMs). The primary post-training methods can be categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). SFT is efficient and well-suited for small language models, but it may lead to overfitting and limit the reasoning abilities of larger models. In contrast, RFT generally yields better generalization but depends heavily on the strength of the base model. To address the limitations of SFT and RFT, we propose Unified Fine-Tuning (UFT), a novel post-training paradigm that unifies SFT and RFT into a single, integrated process. UFT enables the model to effectively explore solutions while incorporating informative supervision signals, bridging the gap between memorizing and thinking underlying existing methods. Notably, UFT outperforms both SFT and RFT in general, regardless of model sizes. Furthermore, we theoretically prove that UFT breaks RFT's inherent exponential sample complexity bottleneck, showing for the first time that unified training can exponentially accelerate convergence on long-horizon reasoning tasks.

UFT: Unifying Supervised and Reinforcement Fine-Tuning

TL;DR

UFT is proposed, a novel post-training paradigm that unifies SFT and RFT into a single, integrated process that outperforms both SFT and RFT in general, regardless of model sizes.

Abstract

Post-training has demonstrated its importance in enhancing the reasoning capabilities of large language models (LLMs). The primary post-training methods can be categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). SFT is efficient and well-suited for small language models, but it may lead to overfitting and limit the reasoning abilities of larger models. In contrast, RFT generally yields better generalization but depends heavily on the strength of the base model. To address the limitations of SFT and RFT, we propose Unified Fine-Tuning (UFT), a novel post-training paradigm that unifies SFT and RFT into a single, integrated process. UFT enables the model to effectively explore solutions while incorporating informative supervision signals, bridging the gap between memorizing and thinking underlying existing methods. Notably, UFT outperforms both SFT and RFT in general, regardless of model sizes. Furthermore, we theoretically prove that UFT breaks RFT's inherent exponential sample complexity bottleneck, showing for the first time that unified training can exponentially accelerate convergence on long-horizon reasoning tasks.

Paper Structure

This paper contains 43 sections, 10 theorems, 67 equations, 11 figures, 8 tables, 2 algorithms.

Key Result

theorem 1

For any integers $H\geq 1, B\geq 2$, and any RFT algorithm, there exists a problem with height $H$ and branching factor $B$, that satisfies the following: to achieve a $50\%$ pass@1 success rate, the algorithm needs to explore at least nodes in $\cS_H$. Moreover, when there are multiple nodes in $\cS_H$ representing the correct solutions, e.g., $K \geq 1$, any algorithm needs to explore at least

Figures (11)

  • Figure 1: (top left, top right, middle, bottom). The illustration of SFT, RFT, SFT-RFT, and UFT, respectively. SFT-RFT refers to applying RFT after an initial SFT stage guo2025deepseek-r1zeng2025simplerl-zoo. (Top, center). shows the annotation usage of different algorithms over training. Curves are slightly shifted for better visibility.
  • Figure 2: Presentation for different algorithms' accuracy when trained on Countdown WikipediaCountdown, MATH(3,4,5) (level 3-5 only) hendrycksmath2021-mathzeng2025simplerl-zoo, and the Knights and Knaves logic puzzle (Logic) xie2025logic-rl. Accuracy is averaged over Qwen2.5 models of sizes 0.5B, 1.5B, and 3B qwen2.5. Base refers to the model without fine-tuning, and $R^3$ is the curriculum reinforcement learning baseline xi2024training-rlhint-uniform. The figure shows that UFT outperforms both SFT and RFT, while the relative performance of SFT and RFT varies depending on task complexity.
  • Figure 3: An illustration of the Countdown game, where the goal is to obtain 24 by applying basic arithmetic operations ($+,-,\times,\div$) to the numbers $(3, 5, 7, 13)$. The green path represents the correct solution.
  • Figure 4: (left). An illustration of the UFT prompt. We adopt the prompting template from TinyZero tinyzero, which is similar to that used in Deepseek-R1 guo2025deepseek-r1. The hint consists of a slice of the full solution. During training, the question prompt and the hint are concatenated and fed to the model. (right). An illustration of the training curve of Qwen2.5-0.5B. Stage and UFT keep zero hint since step 300.
  • Figure 5: An ablation study of different hint length schedulers. RFT (cosine) refers to reinforcement learning with our cosine annealing hint length scheduler proposed in this section.
  • ...and 6 more figures

Theorems & Definitions (19)

  • remark 1
  • definition 1: Sub-Optimality Gap
  • theorem 1: Lowerbound
  • theorem 2: Informal
  • proof
  • theorem 3: Formal
  • proof
  • proposition 1
  • lemma 1
  • lemma 2
  • ...and 9 more