Table of Contents
Fetching ...

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, Xiang Yue

TL;DR

Does gains in mathematical reasoning transfer to general LLM capabilities? The study compares reinforcement-learning-tuned and supervised-fine-tuned models across math, scientific QA, coding, planning, and instruction-following tasks, introducing the Transferability Index to quantify cross-domain transfer. It finds RL-tuned models generalize across domains while SFT-tuned models often degrade non-math performance due to representational and token-distribution drift. Probing analyses (PCA on latent states and KL/divergence on token distributions) reveal that RL preserves latent structure and focuses updates on task-relevant tokens, whereas SFT drives broad drift and forgetting; UniReason demonstrates the strongest balance of math gains and general-domain retention. The results argue for revising post-training recipes to emphasize on-policy RL and careful data/objective design to achieve robust cross-domain reasoning.

Abstract

Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments on Qwen3-14B models using math-only data but different tuning methods. We find that reinforcement learning (RL)-tuned models generalize well across domains, while supervised fine-tuning (SFT)-tuned models often forget general capabilities. Latent-space representation and token-space distribution shift analyses reveal that SFT induces substantial representation and output drift, while RL preserves general-domain structure. Our results suggest a need to rethink standard post-training recipes, particularly the reliance on SFT-distilled data for advancing reasoning models.

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

TL;DR

Does gains in mathematical reasoning transfer to general LLM capabilities? The study compares reinforcement-learning-tuned and supervised-fine-tuned models across math, scientific QA, coding, planning, and instruction-following tasks, introducing the Transferability Index to quantify cross-domain transfer. It finds RL-tuned models generalize across domains while SFT-tuned models often degrade non-math performance due to representational and token-distribution drift. Probing analyses (PCA on latent states and KL/divergence on token distributions) reveal that RL preserves latent structure and focuses updates on task-relevant tokens, whereas SFT drives broad drift and forgetting; UniReason demonstrates the strongest balance of math gains and general-domain retention. The results argue for revising post-training recipes to emphasize on-policy RL and careful data/objective design to achieve robust cross-domain reasoning.

Abstract

Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments on Qwen3-14B models using math-only data but different tuning methods. We find that reinforcement learning (RL)-tuned models generalize well across domains, while supervised fine-tuning (SFT)-tuned models often forget general capabilities. Latent-space representation and token-space distribution shift analyses reveal that SFT induces substantial representation and output drift, while RL preserves general-domain structure. Our results suggest a need to rethink standard post-training recipes, particularly the reliance on SFT-distilled data for advancing reasoning models.

Paper Structure

This paper contains 36 sections, 6 equations, 20 figures, 11 tables.

Figures (20)

  • Figure 1: Impact of SFT and RL using math-only training queries on the same base model, Qwen3-14B-Base. Performance improvements are measured relative to the base model. SFT-trained models show limited transfer to non-reasoning tasks. In contrast, RL-trained models exhibit broader generalization across both reasoning and non-reasoning scenarios.
  • Figure 2: Transferability of mathematical reasoning to other reasoning and non-reasoning tasks. The Transferability Index measures a model's ability to transfer performance from mathematics to other domains, with positive values indicating successful transfer and negative values indicating performance degradation. Details of this metric can be found in Section \ref{['sec:phenomena']}. RL models consistently outperform SFT models, regardless of model size, architecture, or training data, demonstrating superior transferability.
  • Figure 3: PCA shift of Qwen3-14B-Base across different training methods and tasks. $d^{(*)}$ is the Euclidean distance between representation centroids before and after training. The first two rows show models trained with SFT, and the last row shows models trained with RL. RL training results in the smallest PCA shift for all task types, suggesting more stable latent representations.
  • Figure 4: KL divergence analysis of RL and SFT models. Higher KL divergence indicates greater distribution shifts from the original backbone model. We observe that RL models consistently exhibit significantly lower KL divergence compared to SFT models across different tasks, suggesting less distribution shift during training.
  • Figure 5: Word clouds showing significantly shifted tokens between UniReason-Qwen3-14B-RL (left) and UniReason-Qwen3-14B-SFT-think (right). Tokens are extracted based on frequency and rank shifts compared with base model then categorized as logical-structural words (in red) or content-specific words (in blue). The RL model promptly shifts logic-related tokens such as But and So while the SFT model shifts including many irrelevant tokens.
  • ...and 15 more figures