Table of Contents
Fetching ...

Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs

Aldo Pareja, Nikhil Shivakumar Nayak, Hao Wang, Krishnateja Killamsetty, Shivchander Sudalairaj, Wenlong Zhao, Seungwook Han, Abhishek Bhandwaldar, Guangxuan Xu, Kai Xu, Ligong Han, Luke Inglis, Akash Srivastava

TL;DR

This study tackles the resource-access gap in fine-tuning small LLMs by conducting a comprehensive empirical analysis of instruction-tuned 3–7B models across knowledge- and skills-based datasets. It explicitly compares stacked versus phased training, probes hyperparameters, and challenges common practices such as small batch sizes and warmup schedules. Key findings show that large batch sizes with lower learning rates improve generalization on benchmarks like MMLU and MTBench, and that early training dynamics can predict final performance, enabling compute savings. The results yield practical guidelines for practitioners with limited compute and argue for broader inclusivity in LLM research across architectures and data regimes.

Abstract

The rise of large language models (LLMs) has created a significant disparity: industrial research labs with their computational resources, expert teams, and advanced infrastructures, can effectively fine-tune LLMs, while individual developers and small organizations face barriers due to limited resources. In this paper, we aim to bridge this gap by presenting a comprehensive study on supervised fine-tuning of LLMs using instruction-tuning datasets spanning diverse knowledge domains and skills. We focus on small-sized LLMs (3B to 7B parameters) for their cost-efficiency and accessibility. We explore various training configurations and strategies across four open-source pre-trained models. We provide detailed documentation of these configurations, revealing findings that challenge several common training practices, including hyperparameter recommendations from TULU and phased training recommended by Orca. Key insights from our work include: (i) larger batch sizes paired with lower learning rates lead to improved model performance on benchmarks such as MMLU, MTBench, and Open LLM Leaderboard; (ii) early-stage training dynamics, such as lower gradient norms and higher loss values, are strong indicators of better final model performance, enabling early termination of sub-optimal runs and significant computational savings; (iii) through a thorough exploration of hyperparameters like warmup steps and learning rate schedules, we provide guidance for practitioners and find that certain simplifications do not compromise performance; and (iv) we observed no significant difference in performance between phased and stacked training strategies, but stacked training is simpler and more sample efficient. With these findings holding robustly across datasets and models, we hope this study serves as a guide for practitioners fine-tuning small LLMs and promotes a more inclusive environment for LLM research.

Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs

TL;DR

This study tackles the resource-access gap in fine-tuning small LLMs by conducting a comprehensive empirical analysis of instruction-tuned 3–7B models across knowledge- and skills-based datasets. It explicitly compares stacked versus phased training, probes hyperparameters, and challenges common practices such as small batch sizes and warmup schedules. Key findings show that large batch sizes with lower learning rates improve generalization on benchmarks like MMLU and MTBench, and that early training dynamics can predict final performance, enabling compute savings. The results yield practical guidelines for practitioners with limited compute and argue for broader inclusivity in LLM research across architectures and data regimes.

Abstract

The rise of large language models (LLMs) has created a significant disparity: industrial research labs with their computational resources, expert teams, and advanced infrastructures, can effectively fine-tune LLMs, while individual developers and small organizations face barriers due to limited resources. In this paper, we aim to bridge this gap by presenting a comprehensive study on supervised fine-tuning of LLMs using instruction-tuning datasets spanning diverse knowledge domains and skills. We focus on small-sized LLMs (3B to 7B parameters) for their cost-efficiency and accessibility. We explore various training configurations and strategies across four open-source pre-trained models. We provide detailed documentation of these configurations, revealing findings that challenge several common training practices, including hyperparameter recommendations from TULU and phased training recommended by Orca. Key insights from our work include: (i) larger batch sizes paired with lower learning rates lead to improved model performance on benchmarks such as MMLU, MTBench, and Open LLM Leaderboard; (ii) early-stage training dynamics, such as lower gradient norms and higher loss values, are strong indicators of better final model performance, enabling early termination of sub-optimal runs and significant computational savings; (iii) through a thorough exploration of hyperparameters like warmup steps and learning rate schedules, we provide guidance for practitioners and find that certain simplifications do not compromise performance; and (iv) we observed no significant difference in performance between phased and stacked training strategies, but stacked training is simpler and more sample efficient. With these findings holding robustly across datasets and models, we hope this study serves as a guide for practitioners fine-tuning small LLMs and promotes a more inclusive environment for LLM research.

Paper Structure

This paper contains 34 sections, 1 equation, 19 figures, 13 tables.

Figures (19)

  • Figure 1: Correlation between early training dynamics and final performance on MMLU and MTBench benchmarks for TULU vs. LAB Phase 10 training.
  • Figure 2: LAB Learning Rate (LR) Sweep: Training Dynamics and MTBench Performance. MMLU results are provided in Appendix \ref{['appendix:training_dynamics']}.
  • Figure 3: Comparison of stacked and phased training strategies on MTBench using LAB hyperparameters.
  • Figure 4: Final MMLU Performance comparison using LAB hyperparameters: stacked vs. phased training.
  • Figure 5: MMLU Sample efficiency comparison between stacked and phased training.
  • ...and 14 more figures