Table of Contents
Fetching ...

ENOVA: Autoscaling towards Cost-effective and Stable Serverless LLM Serving

Tao Huang, Pengfei Chen, Kyoka Gong, Jocky Hawk, Zachary Bright, Wenxin Xie, Kecheng Huang, Zhi Ji

TL;DR

This work tackles the challenge of stable and cost-effective serverless LLM serving on multi-GPU clusters by introducing ENOVA, a system that jointly optimizes service configurations and real-time autoscaling. It decomposes LLM inference into actionable configurations (e.g., max_num_seqs, gpu_memory, max_tokens) and pairs this with a semi-supervised performance detector based on a variational auto-encoder to trigger reconfigurations. The authors implement an end-to-end deployment engine, load-balancing, and multi-cluster scheduling, and provide extensive experiments showing ENOVA outperforms baselines in throughput and resilience, while preserving accuracy. The work demonstrates practical impact for large online systems by enabling stable, scalable LLM serving with reduced developer overhead and accessible code at the project repository.

Abstract

Since the increasing popularity of large language model (LLM) backend systems, it is common and necessary to deploy stable serverless serving of LLM on multi-GPU clusters with autoscaling. However, there exist challenges because the diversity and co-location of applications in multi-GPU clusters will lead to low service quality and GPU utilization. To address them, we build ENOVA, a deployment, monitoring and autoscaling service towards serverless LLM serving. ENOVA deconstructs the execution process of LLM service comprehensively, based on which ENOVA designs a configuration recommendation module for automatic deployment on any GPU clusters and a performance detection module for autoscaling. On top of them, ENOVA implements a deployment execution engine for multi-GPU cluster scheduling. The experiment results show that ENOVA significantly outperforms other state-of-the-art methods and is suitable for wide deployment in large online systems.

ENOVA: Autoscaling towards Cost-effective and Stable Serverless LLM Serving

TL;DR

This work tackles the challenge of stable and cost-effective serverless LLM serving on multi-GPU clusters by introducing ENOVA, a system that jointly optimizes service configurations and real-time autoscaling. It decomposes LLM inference into actionable configurations (e.g., max_num_seqs, gpu_memory, max_tokens) and pairs this with a semi-supervised performance detector based on a variational auto-encoder to trigger reconfigurations. The authors implement an end-to-end deployment engine, load-balancing, and multi-cluster scheduling, and provide extensive experiments showing ENOVA outperforms baselines in throughput and resilience, while preserving accuracy. The work demonstrates practical impact for large online systems by enabling stable, scalable LLM serving with reduced developer overhead and accessible code at the project repository.

Abstract

Since the increasing popularity of large language model (LLM) backend systems, it is common and necessary to deploy stable serverless serving of LLM on multi-GPU clusters with autoscaling. However, there exist challenges because the diversity and co-location of applications in multi-GPU clusters will lead to low service quality and GPU utilization. To address them, we build ENOVA, a deployment, monitoring and autoscaling service towards serverless LLM serving. ENOVA deconstructs the execution process of LLM service comprehensively, based on which ENOVA designs a configuration recommendation module for automatic deployment on any GPU clusters and a performance detection module for autoscaling. On top of them, ENOVA implements a deployment execution engine for multi-GPU cluster scheduling. The experiment results show that ENOVA significantly outperforms other state-of-the-art methods and is suitable for wide deployment in large online systems.
Paper Structure (33 sections, 8 equations, 8 figures, 4 tables)

This paper contains 33 sections, 8 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: The monitoring metrics for running and pending requests when the requests per second sent to LLM service is set to $7$ and $6$ respectively.
  • Figure 2: The procedure by which we deconstruct the process of LLM inference led to the design of configuration module and detection module in ENOVA.
  • Figure 3: The implementation components of ENOVA, designed to ensure accurate execution of the deployment, monitoring, and autoscaling services.
  • Figure 4: The throughput and latency performance comparison between ENOVA and baselines on five LLMs.
  • Figure 5: The accuracy and pass@1 of five LLMs on gsm8k and mbpp dataset respectively.
  • ...and 3 more figures