ENOVA: Autoscaling towards Cost-effective and Stable Serverless LLM Serving
Tao Huang, Pengfei Chen, Kyoka Gong, Jocky Hawk, Zachary Bright, Wenxin Xie, Kecheng Huang, Zhi Ji
TL;DR
This work tackles the challenge of stable and cost-effective serverless LLM serving on multi-GPU clusters by introducing ENOVA, a system that jointly optimizes service configurations and real-time autoscaling. It decomposes LLM inference into actionable configurations (e.g., max_num_seqs, gpu_memory, max_tokens) and pairs this with a semi-supervised performance detector based on a variational auto-encoder to trigger reconfigurations. The authors implement an end-to-end deployment engine, load-balancing, and multi-cluster scheduling, and provide extensive experiments showing ENOVA outperforms baselines in throughput and resilience, while preserving accuracy. The work demonstrates practical impact for large online systems by enabling stable, scalable LLM serving with reduced developer overhead and accessible code at the project repository.
Abstract
Since the increasing popularity of large language model (LLM) backend systems, it is common and necessary to deploy stable serverless serving of LLM on multi-GPU clusters with autoscaling. However, there exist challenges because the diversity and co-location of applications in multi-GPU clusters will lead to low service quality and GPU utilization. To address them, we build ENOVA, a deployment, monitoring and autoscaling service towards serverless LLM serving. ENOVA deconstructs the execution process of LLM service comprehensively, based on which ENOVA designs a configuration recommendation module for automatic deployment on any GPU clusters and a performance detection module for autoscaling. On top of them, ENOVA implements a deployment execution engine for multi-GPU cluster scheduling. The experiment results show that ENOVA significantly outperforms other state-of-the-art methods and is suitable for wide deployment in large online systems.
