Table of Contents
Fetching ...

On the Role of Reasoning Patterns in the Generalization Discrepancy of Long Chain-of-Thought Supervised Fine-Tuning

Zhaoyi Li, Xiangyu Xi, Zhengyu Chen, Wei Wang, Gangwei Jiang, Ranran Shen, Linqi Song, Ying Wei, Defu Lian

Abstract

Supervised Fine-Tuning (SFT) on long Chain-of-Thought (CoT) trajectories has become a pivotal phase in building large reasoning models. However, how CoT trajectories from different sources influence the generalization performance of models remains an open question. In this paper, we conduct a comparative study using two sources of verified CoT trajectories generated by two competing models, \texttt{DeepSeek-R1-0528} and \texttt{gpt-oss-120b}, with their problem sets controlled to be identical. Despite their comparable performance, we uncover a striking paradox: lower training loss does not translate to better generalization. SFT on \texttt{DeepSeek-R1-0528} data achieves remarkably lower training loss, yet exhibits significantly worse generalization performance on reasoning benchmarks compared to those trained on \texttt{gpt-oss-120b}. To understand this paradox, we perform a multi-faceted analysis probing token-level SFT loss and step-level reasoning behaviors. Our analysis reveals a difference in reasoning patterns. \texttt{gpt-oss-120b} exhibits highly convergent and deductive trajectories, whereas \texttt{DeepSeek-R1-0528} favors a divergent and branch-heavy exploration pattern. Consequently, models trained with \texttt{DeepSeek-R1} data inherit inefficient exploration behaviors, often getting trapped in redundant exploratory branches that hinder them from reaching correct solutions. Building upon this insight, we propose a simple yet effective remedy of filtering out frequently branching trajectories to improve the generalization of SFT. Experiments show that training on selected \texttt{DeepSeek-R1-0528} subsets surprisingly improves reasoning performance by up to 5.1% on AIME25, 5.5% on BeyondAIME, and on average 3.6% on five benchmarks.

On the Role of Reasoning Patterns in the Generalization Discrepancy of Long Chain-of-Thought Supervised Fine-Tuning

Abstract

Supervised Fine-Tuning (SFT) on long Chain-of-Thought (CoT) trajectories has become a pivotal phase in building large reasoning models. However, how CoT trajectories from different sources influence the generalization performance of models remains an open question. In this paper, we conduct a comparative study using two sources of verified CoT trajectories generated by two competing models, \texttt{DeepSeek-R1-0528} and \texttt{gpt-oss-120b}, with their problem sets controlled to be identical. Despite their comparable performance, we uncover a striking paradox: lower training loss does not translate to better generalization. SFT on \texttt{DeepSeek-R1-0528} data achieves remarkably lower training loss, yet exhibits significantly worse generalization performance on reasoning benchmarks compared to those trained on \texttt{gpt-oss-120b}. To understand this paradox, we perform a multi-faceted analysis probing token-level SFT loss and step-level reasoning behaviors. Our analysis reveals a difference in reasoning patterns. \texttt{gpt-oss-120b} exhibits highly convergent and deductive trajectories, whereas \texttt{DeepSeek-R1-0528} favors a divergent and branch-heavy exploration pattern. Consequently, models trained with \texttt{DeepSeek-R1} data inherit inefficient exploration behaviors, often getting trapped in redundant exploratory branches that hinder them from reaching correct solutions. Building upon this insight, we propose a simple yet effective remedy of filtering out frequently branching trajectories to improve the generalization of SFT. Experiments show that training on selected \texttt{DeepSeek-R1-0528} subsets surprisingly improves reasoning performance by up to 5.1% on AIME25, 5.5% on BeyondAIME, and on average 3.6% on five benchmarks.

Paper Structure

This paper contains 20 sections, 12 figures, 7 tables.

Figures (12)

  • Figure 1: (a $\sim$ d): SFT training loss comparison of different models trained on long CoT trajectories of DeepSeek-R1 and gpt-oss-120b. (e) and (f): average testing performance on five benchmarks with varying training steps and inference context length. Blue/red curves refer to experiments with gpt-oss-120b/DeepSeek-R1-generated data, respectively.
  • Figure 2: Token-level SFT loss analysis for the Qwen3-8B model. (a) show the token-level loss distribution after the SFT training, where blue/red bars represent experiments with DeepSeek-R1/gpt-oss-120b trajectories. (b, c) are word clouds for the most frequent tokens in the top $10\%$ token-level loss token subset of the (DeepSeek-R1, gpt-oss-120b) data.
  • Figure 3: Reasoning behavior distributions (a, c) and transition matrices (b, d) for reasoning trajectories used for training and generated for solving AIME24 problems with fine-tuned Qwen3-8B. Blue/red objects: experiments with DeepSeek-R1/gpt-oss-120b-generated data.
  • Figure 4: Performance change ratio ($(\text{Acc}_{\text{original}}-\text{Acc}_{\text{retrain}})/\text{Acc}_{\text{original}}$) on five benchmarks after randomly deleting 10% reasoning steps in each training trajectory. Blue/red bars represent experiments with DeepSeek-R1/gpt-oss-120b-generated data, respectively.
  • Figure 5: The prompt template for annotating reasoning steps with four behavior labels.
  • ...and 7 more figures