Frenzy: A Memory-Aware Serverless LLM Training System for Heterogeneous GPU Clusters
Zihan Chang, Sheng Xiao, Shuibing He, Siling Yang, Zhe Pan, Dong Li
TL;DR
Frenzy tackles the burden of manual GPU selection for LLM training on heterogeneous clusters by introducing MARP, a memory predictor, and HAS, a heterogeneity-aware scheduler in a serverless framework. It demonstrates accurate memory forecasting (92–98%), low-overhead scheduling (≈10x faster), and 12–18% reductions in job completion time across real and simulated heterogeneous clusters. The approach enables seamless deployment and training without user-managed hardware specifics, improving utilization and efficiency. This work broadens serverless applicability to heterogeneous GPU ecosystems and offers a practical path for scalable LLM training.
Abstract
Existing work only effective on a given number of GPUs, often neglecting the complexities involved in manually determining the specific types and quantities of GPUs needed, which can be a significant burden for developers. To address this issue, we propose Frenzy, a memory-aware serverless computing method for heterogeneous GPU clusters. Frenzy allows users to submit models without worrying about underlying hardware resources. First, Frenzy predicts the required number and type of GPUs by estimating the GPU memory usage of the LLM. Then, it employs a low-overhead heterogeneity-aware scheduling method to optimize training efficiency. We validated Frenzy's performance by conducting multi-task LLM training tests on a heterogeneous GPU cluster with three different GPU types. The results show that Frenzy's memory usage prediction accuracy exceeds 92\%, the scheduling overhead is reduced by 10 times, and it reduces the average job completion time by 12\% to 18\% compared to state-of-the-art methods.
