Table of Contents
Fetching ...

Massive Supervised Fine-tuning Experiments Reveal How Data, Layer, and Training Factors Shape LLM Alignment Quality

Yuto Harada, Yusuke Yamauchi, Yusuke Oda, Yohei Oseki, Yusuke Miyao, Yu Takagi

TL;DR

The paper investigates how data, model architecture, and training choices shape the alignment quality of LLMs under supervised fine-tuning (SFT) through a large-scale, controlled study across 12 base models and 10 English-language datasets. It reveals that perplexity relative to the base model robustly predicts downstream gains, mid-layer weight updates best track performance, and embedding-space analyses show a shared instruction-following trajectory across models, with model architecture exerting a strong influence on representations. The work also demonstrates that LoRA trajectories closely mirror full-parameter fine-tuning and that cross-lingual transfer persists even when training data are English-only, offering practical guidance for efficient SFT. By releasing over 1,000 fine-tuned models and a rich benchmark suite, the study provides a valuable resource for understanding SFT dynamics and guiding future alignment research.

Abstract

Supervised fine-tuning (SFT) is a critical step in aligning large language models (LLMs) with human instructions and values, yet many aspects of SFT remain poorly understood. We trained a wide range of base models on a variety of datasets including code generation, mathematical reasoning, and general-domain tasks, resulting in 1,000+ SFT models under controlled conditions. We then identified the dataset properties that matter most and examined the layer-wise modifications introduced by SFT. Our findings reveal that some training-task synergies persist across all models while others vary substantially, emphasizing the importance of model-specific strategies. Moreover, we demonstrate that perplexity consistently predicts SFT effectiveness, often surpassing superficial similarity between the training data and the benchmark, and that mid-layer weight changes correlate most strongly with performance gains. We release these 1,000+ SFT models and benchmark results to accelerate further research. All resources are available at https://github.com/llm-jp/massive-sft.

Massive Supervised Fine-tuning Experiments Reveal How Data, Layer, and Training Factors Shape LLM Alignment Quality

TL;DR

The paper investigates how data, model architecture, and training choices shape the alignment quality of LLMs under supervised fine-tuning (SFT) through a large-scale, controlled study across 12 base models and 10 English-language datasets. It reveals that perplexity relative to the base model robustly predicts downstream gains, mid-layer weight updates best track performance, and embedding-space analyses show a shared instruction-following trajectory across models, with model architecture exerting a strong influence on representations. The work also demonstrates that LoRA trajectories closely mirror full-parameter fine-tuning and that cross-lingual transfer persists even when training data are English-only, offering practical guidance for efficient SFT. By releasing over 1,000 fine-tuned models and a rich benchmark suite, the study provides a valuable resource for understanding SFT dynamics and guiding future alignment research.

Abstract

Supervised fine-tuning (SFT) is a critical step in aligning large language models (LLMs) with human instructions and values, yet many aspects of SFT remain poorly understood. We trained a wide range of base models on a variety of datasets including code generation, mathematical reasoning, and general-domain tasks, resulting in 1,000+ SFT models under controlled conditions. We then identified the dataset properties that matter most and examined the layer-wise modifications introduced by SFT. Our findings reveal that some training-task synergies persist across all models while others vary substantially, emphasizing the importance of model-specific strategies. Moreover, we demonstrate that perplexity consistently predicts SFT effectiveness, often surpassing superficial similarity between the training data and the benchmark, and that mid-layer weight changes correlate most strongly with performance gains. We release these 1,000+ SFT models and benchmark results to accelerate further research. All resources are available at https://github.com/llm-jp/massive-sft.

Paper Structure

This paper contains 29 sections, 2 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Overview of this study. We conduct SFT on numerous combinations of base models and training data. These models are evaluated on a variety of benchmark tasks to comprehensively examine the relationships among the base models, training data, and benchmark tasks.
  • Figure 2: a Average of the performance change for diverse benchmarks from the each baseline model after SFT on each training dataset. Each column is min-max scaled to the [$-1,1$] range. b The performance changes visualized for each model individually. c Pairwise correlation matrix of performance changes across all SFT models, with the corresponding hierarchical-clustering dendrogram superimposed. d The cumulative explained variance ratio obtained by applying PCA to all concatenated results from b.
  • Figure 3: a Pairwise correlations between evaluation tasks in terms of performance improvements across training datasets. b Similar to a, but focusing on relationship between correlations between training datasets. c Model-to-model similarity for a (top) and b (bottom), respectively. d Comparison of the lower-triangle elements of the two similarity matrices in c.
  • Figure 4: Analysis of training data properties that affect downstream performance. We compare perplexity (a), and token length (b) with the average performance changes of benchmark tasks for the SFT models, highlighting that lower perplexity is a strong predictor of higher performance.
  • Figure 5: Layer-wise weight changes and their correlations with performance improvements. a Blue line indicates correlation coefficients between the amount of weight change from the base model and the overall improvement in accuracy, plotted as a function of layer position (0 = input; 1 = output). Compared to early and late layers, the mid-layers (0.6, indicated by red arrow) exhibit the strongest correlation. Orange line indicates the amount of weight change from the base model. b Focusing on the mid-layer (0.6), examining the relationship between the amount of weight change and accuracy change for each model reveals a robust correlation across all models. c Correlations calculated across models between weight changes from the base model and those from models trained on specific data. Again, the mid-layers show the strongest model-to-model correlation. d Intrinsic dimensionality (ID) of training-data embeddings before (blue line) vs. after SFT (red line). The divergence emerges around layer-position = 0.6 (dashed line), suggesting that mid-layer updates expand the representational subspace.
  • ...and 5 more figures