Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training

Aojun Lu; Tao Feng; Hangjie Yuan; Wei Li; Yanan Sun

Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training

Aojun Lu, Tao Feng, Hangjie Yuan, Wei Li, Yanan Sun

TL;DR

This work investigates why RL-based post-training generalizes better than SFT for Vision-Language Models. It argues a data-centric mechanism: RL implicitly focuses updates on medium-difficulty samples, while standard SFT overfits or degrades when hard samples dominate optimization. To exploit this insight, the authors introduce DC-SFT, with variants that filter data by difficulty to emulate RL's desirable data distribution; DC-SFT demonstrates superior OOD generalization, better training stability, and higher efficiency than RL across image classification, visual grounding, and reasoning tasks. The approach offers a practical, scalable path to robust generalization in VLMs, with code available at the provided GitHub link.

Abstract

The adaptation of large-scale Vision-Language Models (VLMs) through post-training reveals a pronounced generalization gap: models fine-tuned with Reinforcement Learning (RL) consistently achieve superior out-of-distribution (OOD) performance compared to those trained with Supervised Fine-Tuning (SFT). This paper posits a data-centric explanation for this phenomenon, contending that RL's generalization advantage arises from an implicit data filtering mechanism that inherently prioritizes medium-difficulty training samples. To test this hypothesis, we systematically evaluate the OOD generalization of SFT models across training datasets of varying difficulty levels. Our results confirm that data difficulty is a critical factor, revealing that training on hard samples significantly degrades OOD performance. Motivated by this finding, we introduce Difficulty-Curated SFT (DC-SFT), a straightforward method that explicitly filters the training set based on sample difficulty. Experiments show that DC-SFT not only substantially enhances OOD generalization over standard SFT, but also surpasses the performance of RL-based training, all while providing greater stability and computational efficiency. This work offers a data-centric account of the OOD generalization gap in VLMs and establishes a more efficient pathway to achieving robust generalization. Code is available at https://github.com/byyx666/DC-SFT.

Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training

TL;DR

Abstract

Paper Structure (17 sections, 5 equations, 6 figures, 6 tables)

This paper contains 17 sections, 5 equations, 6 figures, 6 tables.

Introduction
Related Works
Preliminaries
Data-Centric Analysis of Generalization
The Data-Centric Hypothesis of Generalization
Experimental Setup
Performance Results
DC-SFT: Enhancing SFT's Generalization
Evaluation Settings
Performance Results
Training Stability Analysis
Efficiency Analysis
Scalability analysis
Analysis and Discussion
Hard Data's Impact on SFT Generalization
...and 2 more sections

Figures (6)

Figure 1: (a) RL implicitly focuses updates on medium-difficulty samples that yield high reward variance. (b) ID and OOD performance after SFT on data subsets of varying difficulty levels.
Figure 2: (a) Illustrative examples of the data difficulty taxonomy. (b) Illustrative examples of generalization evaluation benchmarks for image classification (top) and visual grounding (bottom).
Figure 3: Performance curves of different post-training paradigms using Qwen2.5-VL-7B as the backbone.
Figure 4: Training time comparison of Qwen2.5-VL-7B on ImageNet and RefCOCO.
Figure 5: The impact of hard data ratio on OOD performance.
...and 1 more figures

Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training

TL;DR

Abstract

Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training

Authors

TL;DR

Abstract

Table of Contents

Figures (6)