Table of Contents
Fetching ...

Understanding the Performance and Estimating the Cost of LLM Fine-Tuning

Yuchen Xia, Jiho Kim, Yuhan Chen, Haojie Ye, Souvik Kundu, Cong Hao, Nishil Talati

TL;DR

The paper addresses the cost and performance of fine-tuning large language models by profiling sparse MoE approaches on a single GPU, comparing Mixtral and BlackMamba across domain datasets. It demonstrates that MoE layers dominate runtime and that sparse MoE can match dense accuracy while enabling larger batch sizes and higher throughput. A validated analytical model links model size, dataset size, and GPU architecture to cloud-based fine-tuning cost, providing practical budgeting guidance (e.g., ~$3460 for a H100-based 2M-query Mixtral fine-tune). This work offers actionable insights for practitioners and a generalizable framework for estimating and optimizing the cost of LLM fine-tuning on cloud platforms.

Abstract

Due to the cost-prohibitive nature of training Large Language Models (LLMs), fine-tuning has emerged as an attractive alternative for specializing LLMs for specific tasks using limited compute resources in a cost-effective manner. In this paper, we characterize sparse Mixture of Experts (MoE) based LLM fine-tuning to understand their accuracy and runtime performance on a single GPU. Our evaluation provides unique insights into the training efficacy of sparse and dense versions of MoE models, as well as their runtime characteristics, including maximum batch size, execution time breakdown, end-to-end throughput, GPU hardware utilization, and load distribution. Our study identifies the optimization of the MoE layer as crucial for further improving the performance of LLM fine-tuning. Using our profiling results, we also develop and validate an analytical model to estimate the cost of LLM fine-tuning on the cloud. This model, based on parameters of the model and GPU architecture, estimates LLM throughput and the cost of training, aiding practitioners in industry and academia to budget the cost of fine-tuning a specific model.

Understanding the Performance and Estimating the Cost of LLM Fine-Tuning

TL;DR

The paper addresses the cost and performance of fine-tuning large language models by profiling sparse MoE approaches on a single GPU, comparing Mixtral and BlackMamba across domain datasets. It demonstrates that MoE layers dominate runtime and that sparse MoE can match dense accuracy while enabling larger batch sizes and higher throughput. A validated analytical model links model size, dataset size, and GPU architecture to cloud-based fine-tuning cost, providing practical budgeting guidance (e.g., ~$3460 for a H100-based 2M-query Mixtral fine-tune). This work offers actionable insights for practitioners and a generalizable framework for estimating and optimizing the cost of LLM fine-tuning on cloud platforms.

Abstract

Due to the cost-prohibitive nature of training Large Language Models (LLMs), fine-tuning has emerged as an attractive alternative for specializing LLMs for specific tasks using limited compute resources in a cost-effective manner. In this paper, we characterize sparse Mixture of Experts (MoE) based LLM fine-tuning to understand their accuracy and runtime performance on a single GPU. Our evaluation provides unique insights into the training efficacy of sparse and dense versions of MoE models, as well as their runtime characteristics, including maximum batch size, execution time breakdown, end-to-end throughput, GPU hardware utilization, and load distribution. Our study identifies the optimization of the MoE layer as crucial for further improving the performance of LLM fine-tuning. Using our profiling results, we also develop and validate an analytical model to estimate the cost of LLM fine-tuning on the cloud. This model, based on parameters of the model and GPU architecture, estimates LLM throughput and the cost of training, aiding practitioners in industry and academia to budget the cost of fine-tuning a specific model.
Paper Structure (36 sections, 2 equations, 15 figures, 4 tables)

This paper contains 36 sections, 2 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: LLM model overview. We evaluate accuracy, throughput, runtime, and GPU characterization for different models, input datasets, and fine-tuning sparsity. The different colored expert boxes in MoE layer means different sets of experts are activated according to the input token.
  • Figure 2: Sequence length distribution for evaluated datasets.
  • Figure 3: Testing accuracy of Mixtral and BlackMamba. Both models are evaluated on two datasets Hellaswag (HE) and GSM8K (GS), using dense and sparse fine-tuning.
  • Figure 4: Execution time breakdown.
  • Figure 5: Execution time breakdown in terms of different model layers.
  • ...and 10 more figures