Table of Contents
Fetching ...

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

Yongliang Wu, Yizhou Zhou, Zhou Ziheng, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming-Hsuan Yang, Xu Yang

TL;DR

The paper analyzes why Supervised Fine-Tuning (SFT) lags behind reinforcement learning in generalization for LLMs and identifies a sparse, inverse-probability reward in the SFT gradient. It introduces Dynamic Fine-Tuning (DFT), a single-line modification that dynamically reweights the SFT loss by token probability, yielding stable, uniformly weighted updates. Empirically, DFT achieves substantial gains across mathematical reasoning benchmarks, code generation, and multi-modal tasks, and shows competitive performance in offline RL settings without extra reward models. The work combines theoretical insight with practical validation, offering a simple yet effective route to close the gap between SFT and RL.

Abstract

We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. Remarkably, this single-line code change significantly outperforms standard SFT across multiple challenging benchmarks and base models, demonstrating greatly improved generalization. Additionally, our approach shows competitive results in offline RL settings, offering an effective yet simpler alternative. This work bridges theoretical insight and practical solutions, substantially advancing SFT performance. The code will be available at https://github.com/yongliang-wu/DFT.

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

TL;DR

The paper analyzes why Supervised Fine-Tuning (SFT) lags behind reinforcement learning in generalization for LLMs and identifies a sparse, inverse-probability reward in the SFT gradient. It introduces Dynamic Fine-Tuning (DFT), a single-line modification that dynamically reweights the SFT loss by token probability, yielding stable, uniformly weighted updates. Empirically, DFT achieves substantial gains across mathematical reasoning benchmarks, code generation, and multi-modal tasks, and shows competitive performance in offline RL settings without extra reward models. The work combines theoretical insight with practical validation, offering a simple yet effective route to close the gap between SFT and RL.

Abstract

We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. Remarkably, this single-line code change significantly outperforms standard SFT across multiple challenging benchmarks and base models, demonstrating greatly improved generalization. Additionally, our approach shows competitive results in offline RL settings, offering an effective yet simpler alternative. This work bridges theoretical insight and practical solutions, substantially advancing SFT performance. The code will be available at https://github.com/yongliang-wu/DFT.

Paper Structure

This paper contains 36 sections, 20 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Accuracy progression for Qwen2.5-Math-1.5B across mathematical benchmarks, illustrating faster convergence and better performance achieved by DFT relative to SFT.
  • Figure 2: Token probability distributions on the training set before training and after fine-tuning with DFT, SFT, and various RL methods. A logarithmic scale is used on the y-axis for clarity.
  • Figure 3: Ablation study of training hyper-parameters, learning rates and batch size, for DFT and SFT on Qwen2.5-Math-1.5B model.