Table of Contents
Fetching ...

SFT Doesn't Always Hurt General Capabilities: Revisiting Domain-Specific Fine-Tuning in LLMs

Jiacheng Lin, Zhongruo Wang, Kun Qian, Tian Wang, Arvind Srinivasan, Hansi Zeng, Ruochen Jiao, Xie Zhou, Jiri Gesi, Dakuo Wang, Yufan Guo, Kai Zhong, Weiqi Zhang, Sujay Sanghavi, Changyou Chen, Hyokun Yun, Lihong Li

TL;DR

The paper challenges the prevailing view that domain-specific SFT inevitably harms general capabilities by showing that smaller learning rates can preserve general performance while maintaining domain gains. It provides an information-theoretic analysis of fine-tuning dynamics and introduces Token-Adaptive Loss Reweighting (TALR) to further mitigate degradation, with a closed-form token-weighting strategy that adapts during training. Empirical results on MedCalc, ESCI, and MetaMathQA demonstrate that TALR often yields superior trade-offs compared with existing baselines, especially when larger learning rates are necessary. The work delivers practical guidelines for domain adaptation and highlights directions for future research to further stabilize and improve cross-domain generalization.

Abstract

Supervised Fine-Tuning (SFT) on domain-specific datasets is a common approach to adapt Large Language Models (LLMs) to specialized tasks but is often believed to degrade their general capabilities. In this work, we revisit this trade-off and present both empirical and theoretical insights. First, we show that SFT does not always hurt: using a smaller learning rate can substantially mitigate general performance degradation while preserving comparable target-domain performance. We then provide a theoretical analysis that explains these phenomena and further motivates a new method, Token-Adaptive Loss Reweighting (TALR). Building on this, and recognizing that smaller learning rates alone do not fully eliminate general-performance degradation in all cases, we evaluate a range of strategies for reducing general capability loss, including L2 regularization, LoRA, model averaging, FLOW, and our proposed TALR. Experimental results demonstrate that while no method completely eliminates the trade-off, TALR consistently outperforms these baselines in balancing domain-specific gains and general capabilities. Finally, we distill our findings into practical guidelines for adapting LLMs to new domains: (i) using a small learning rate to achieve a favorable trade-off, and (ii) when a stronger balance is further desired, adopt TALR as an effective strategy.

SFT Doesn't Always Hurt General Capabilities: Revisiting Domain-Specific Fine-Tuning in LLMs

TL;DR

The paper challenges the prevailing view that domain-specific SFT inevitably harms general capabilities by showing that smaller learning rates can preserve general performance while maintaining domain gains. It provides an information-theoretic analysis of fine-tuning dynamics and introduces Token-Adaptive Loss Reweighting (TALR) to further mitigate degradation, with a closed-form token-weighting strategy that adapts during training. Empirical results on MedCalc, ESCI, and MetaMathQA demonstrate that TALR often yields superior trade-offs compared with existing baselines, especially when larger learning rates are necessary. The work delivers practical guidelines for domain adaptation and highlights directions for future research to further stabilize and improve cross-domain generalization.

Abstract

Supervised Fine-Tuning (SFT) on domain-specific datasets is a common approach to adapt Large Language Models (LLMs) to specialized tasks but is often believed to degrade their general capabilities. In this work, we revisit this trade-off and present both empirical and theoretical insights. First, we show that SFT does not always hurt: using a smaller learning rate can substantially mitigate general performance degradation while preserving comparable target-domain performance. We then provide a theoretical analysis that explains these phenomena and further motivates a new method, Token-Adaptive Loss Reweighting (TALR). Building on this, and recognizing that smaller learning rates alone do not fully eliminate general-performance degradation in all cases, we evaluate a range of strategies for reducing general capability loss, including L2 regularization, LoRA, model averaging, FLOW, and our proposed TALR. Experimental results demonstrate that while no method completely eliminates the trade-off, TALR consistently outperforms these baselines in balancing domain-specific gains and general capabilities. Finally, we distill our findings into practical guidelines for adapting LLMs to new domains: (i) using a small learning rate to achieve a favorable trade-off, and (ii) when a stronger balance is further desired, adopt TALR as an effective strategy.

Paper Structure

This paper contains 46 sections, 18 theorems, 133 equations, 9 figures, 8 tables, 1 algorithm.

Key Result

Proposition 3.1

Consider two model distributions $q_{\theta_1}(\cdot)$ and $q_{\theta_2}(\cdot)$ over the token tree $\mathcal{T}_\mathcal{D}$ with distribution $P$. The change in expected code length on $P$ when shifting from $q_{\theta_1}$ to $q_{\theta_2}$ is $\Delta L(P) = \mathbb{E}_{z\sim P}[L_{q_{\theta_2}}(

Figures (9)

  • Figure 1: Effect of learning rate on domain-specific and general capability performance during supervised fine-tuning (SFT). We conduct experiments on two domain-specific datasets, MedCalc and ESCI. For the ESCI (w/o CoT) variant, the model is trained only to predict the final label without intermediate reasoning steps, unlike the other three settings where reasoning traces are available. General capability performance is measured as the average across IFEval, GSM8K, and HumanEval unless otherwise specified. We observe that smaller learning rates yield a more favorable trade-off (upper-right corner) between domain performance and general performance.
  • Figure 2: Token-level analysis on the MedCalc dataset. (a) Heatmap of token probabilities from Qwen-2.5-3B-Instruct for an example. Darker cells indicate higher model confidence; harder tokens with low probability often correspond to domain-specific concepts. (b) Distribution of token probabilities across the full SFT training set for multiple models. Most tokens are confidently predicted (medians near 1.0), suggesting low learning difficulty. (c) Fraction of tokens with $p>0.2$ increases from epoch 1 to epoch 2 when training updates use tokens with $p>0.2$, showing a clear curriculum phenomenon.
  • Figure 3: Effect of learning rate on domain-specific and general capability performance during supervised fine-tuning (SFT). Results are shown for (a) Qwen3-8B on ESCI with CoT supervision, (b) Qwen3-8B on ESCI without CoT, and (c) DeepSeek-Coder-7B on MetaMathQA. Across all settings, smaller learning rates achieve more favorable trade-offs.
  • Figure 4: Effect of KL regularization on domain-specific SFT. We follow DeepSeek-R1 guo2025deepseek and apply the $k3$ approximation for KL regularization. Results are shown for three learning rates: (a) $1 \times 10^{-6}$, (b) $5 \times 10^{-6}$, and (c) $2 \times 10^{-5}$. Across all settings, KL regularization yields performance that is very close to standard SFT, suggesting limited additional benefit in mitigating general-performance degradation.
  • Figure 5: Effect of learning rate on the trade-off between domain performance and general multi-choice commonsense and knowledge QA performance. Domain performance is measured on MedCalc, while general performance is evaluated as the average accuracy across MMLU, ARC-Easy, ARC-Challenge, PIQA, and HellaSwag. Results are shown for (a) Qwen3-8B, (b) Qwen2.5-7B, (c) Qwen3-4B, and (d) Qwen2.5-3B.
  • ...and 4 more figures

Theorems & Definitions (32)

  • Definition 3.1: Token Tree $\mathcal{T}$
  • Definition 3.2: LLM Compression Protocol
  • Proposition 3.1: Expected Code Length Discrepancy under Model Shift
  • Theorem 3.1
  • Theorem 3.2
  • Definition B.1: Token Tree $\mathcal{T}$
  • Definition B.2: LLM Compression Protocol
  • Proposition B.1: Expected Code Length
  • Proposition B.1: Expected Code Length
  • Proposition B.2: Joint Token Tree for Multiple Datasets
  • ...and 22 more