Table of Contents
Fetching ...

Frenzy: A Memory-Aware Serverless LLM Training System for Heterogeneous GPU Clusters

Zihan Chang, Sheng Xiao, Shuibing He, Siling Yang, Zhe Pan, Dong Li

TL;DR

Frenzy tackles the burden of manual GPU selection for LLM training on heterogeneous clusters by introducing MARP, a memory predictor, and HAS, a heterogeneity-aware scheduler in a serverless framework. It demonstrates accurate memory forecasting (92–98%), low-overhead scheduling (≈10x faster), and 12–18% reductions in job completion time across real and simulated heterogeneous clusters. The approach enables seamless deployment and training without user-managed hardware specifics, improving utilization and efficiency. This work broadens serverless applicability to heterogeneous GPU ecosystems and offers a practical path for scalable LLM training.

Abstract

Existing work only effective on a given number of GPUs, often neglecting the complexities involved in manually determining the specific types and quantities of GPUs needed, which can be a significant burden for developers. To address this issue, we propose Frenzy, a memory-aware serverless computing method for heterogeneous GPU clusters. Frenzy allows users to submit models without worrying about underlying hardware resources. First, Frenzy predicts the required number and type of GPUs by estimating the GPU memory usage of the LLM. Then, it employs a low-overhead heterogeneity-aware scheduling method to optimize training efficiency. We validated Frenzy's performance by conducting multi-task LLM training tests on a heterogeneous GPU cluster with three different GPU types. The results show that Frenzy's memory usage prediction accuracy exceeds 92\%, the scheduling overhead is reduced by 10 times, and it reduces the average job completion time by 12\% to 18\% compared to state-of-the-art methods.

Frenzy: A Memory-Aware Serverless LLM Training System for Heterogeneous GPU Clusters

TL;DR

Frenzy tackles the burden of manual GPU selection for LLM training on heterogeneous clusters by introducing MARP, a memory predictor, and HAS, a heterogeneity-aware scheduler in a serverless framework. It demonstrates accurate memory forecasting (92–98%), low-overhead scheduling (≈10x faster), and 12–18% reductions in job completion time across real and simulated heterogeneous clusters. The approach enables seamless deployment and training without user-managed hardware specifics, improving utilization and efficiency. This work broadens serverless applicability to heterogeneous GPU ecosystems and offers a practical path for scalable LLM training.

Abstract

Existing work only effective on a given number of GPUs, often neglecting the complexities involved in manually determining the specific types and quantities of GPUs needed, which can be a significant burden for developers. To address this issue, we propose Frenzy, a memory-aware serverless computing method for heterogeneous GPU clusters. Frenzy allows users to submit models without worrying about underlying hardware resources. First, Frenzy predicts the required number and type of GPUs by estimating the GPU memory usage of the LLM. Then, it employs a low-overhead heterogeneity-aware scheduling method to optimize training efficiency. We validated Frenzy's performance by conducting multi-task LLM training tests on a heterogeneous GPU cluster with three different GPU types. The results show that Frenzy's memory usage prediction accuracy exceeds 92\%, the scheduling overhead is reduced by 10 times, and it reduces the average job completion time by 12\% to 18\% compared to state-of-the-art methods.

Paper Structure

This paper contains 23 sections, 3 equations, 6 figures, 1 algorithm.

Figures (6)

  • Figure 1: Frenzy overview. User submits a large model training job. MARP predicts the required training resources based on the model hyper-parameters and training configurations, combined with different data parallelism and tensor parallelism numbers, and outputs multiple required resources plans with priorities. HAS retrieves the optimal resource allocation plan among them and then schedule resources based on the heterogeneous GPU cluster.
  • Figure 2: MARP. For a training job, MARP calculates the GPU memory that will be occupied by Model States and Activations during training, based on the model and training config, under different numbers of data parallelism and tensor parallelism. MARP adopts a priority ranking for the requirements training resources plans obtained from different parallel schemes.
  • Figure 3: HAS. For the various resources allocation plans with priorities output by MARP, HAS conducts a sequential search based on the current resource status of the cluster to obtain the optimal resource plan that can be satisfied. Then HAS allocates resources for the job based on the heterogeneous GPU cluster based on this plan.
  • Figure 4: Comparison with opportunistic scheduling. QT means Queue time, and JCT means Job complete time.
  • Figure 5: Frenzy scheduling result compared with Sia
  • ...and 1 more figures