Table of Contents
Fetching ...

mSFT: Addressing Dataset Mixtures Overfitting Heterogeneously in Multi-task SFT

Woosung Koh, Jeyoung Jeon, Youngjin Song, Yujin Cheon, Soowon Oh, Jaehyeong Choi, Se-Young Yun

Abstract

Current language model training commonly applies multi-task Supervised Fine-Tuning (SFT) using a homogeneous compute budget across all sub-datasets. This approach is fundamentally sub-optimal: heterogeneous learning dynamics cause faster-learning tasks to overfit early while slower ones remain under-fitted. To address this, we introduce mSFT, an iterative, overfitting-aware search algorithm for multi-task data mixtures. mSFT trains the model on an active mixture, identifies and excludes the earliest overfitting sub-dataset, and reverts to that specific optimal checkpoint before continuing. Extensive evaluations demonstrate that mSFT consistently outperforms 4 baselines across 10 benchmarks and 6 base models. Further analysis confirms mSFT maintains robust gains across diverse dataset sizes, task granularities, and is insensitive to its single new hyperparameter (compute budget). Notably, at low compute budget, mSFT can improve performance while lowering training FLOPs. Ultimately, mSFT establishes a practical overfitting-aware algorithm for multi-task SFT that maximizes the potential of models across diverse data mixtures.

mSFT: Addressing Dataset Mixtures Overfitting Heterogeneously in Multi-task SFT

Abstract

Current language model training commonly applies multi-task Supervised Fine-Tuning (SFT) using a homogeneous compute budget across all sub-datasets. This approach is fundamentally sub-optimal: heterogeneous learning dynamics cause faster-learning tasks to overfit early while slower ones remain under-fitted. To address this, we introduce mSFT, an iterative, overfitting-aware search algorithm for multi-task data mixtures. mSFT trains the model on an active mixture, identifies and excludes the earliest overfitting sub-dataset, and reverts to that specific optimal checkpoint before continuing. Extensive evaluations demonstrate that mSFT consistently outperforms 4 baselines across 10 benchmarks and 6 base models. Further analysis confirms mSFT maintains robust gains across diverse dataset sizes, task granularities, and is insensitive to its single new hyperparameter (compute budget). Notably, at low compute budget, mSFT can improve performance while lowering training FLOPs. Ultimately, mSFT establishes a practical overfitting-aware algorithm for multi-task SFT that maximizes the potential of models across diverse data mixtures.
Paper Structure (59 sections, 16 equations, 21 figures, 7 tables)

This paper contains 59 sections, 16 equations, 21 figures, 7 tables.

Figures (21)

  • Figure 1: Status quo. Frontier open-weight models continue to employ homogeneous SFT, where all sub-datasets are trained on the same amount of compute.
  • Figure 2: Heterogeneous learning dynamics. Multi-task SFT on Qwen3 8B demonstrates that underlying sub-datasets overfitting dynamics vary greatly. This observation is consistent across all other models; visualized in Appendix \ref{['app:motivation_app']}.
  • Figure 3: Divergence of optimal compute upon dataset exclusion. Excluding a small fraction of the training mixture alters the optimization trajectory, shifting optimal stopping points for remaining tasks. (a) $\Delta$ optimal compute varies across individual sub-tasks. (b) This divergence is consistent across model families and scales, averaging an absolute shift of 0.91 epochs. Detailed decomposition across other models available in Appendix \ref{['app:results_delta']}
  • Figure 4: Further details of main results.[left]mSFT achieves the lowest levels of standard deviation across benchmarks (STD), indicating performance gains are not due to large outliers. [right] Across models, mSFT achieves 1st place the most. The 1st place count does not add up to 60 = 6 $\cdot$ 10 (models $\cdot$ benchmarks) as there are cases where 1st place is tied.
  • Figure 5: Robustness across varying dataset sizes.$\Delta$ Accuracy of Continual SFT, IES, and mSFT relative to SFT. mSFT consistently achieves the highest performance gains across different total dataset sizes and tasks ($N$), avoiding the degradation seen in Continual SFT at larger scales.
  • ...and 16 more figures