Table of Contents
Fetching ...

ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments

Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin Cui, Ana Klimovic, Eiko Yoneki

TL;DR

ThunderServe tackles the challenge of serving large language models in cloud environments with heterogeneous GPUs by formulating a two-level scheduling problem that jointly optimizes GPU grouping, phase designation, parallel configuration, and inter-phase orchestration. It introduces a tabu-search-based upper-level solver and a lower-level optimization that leverages a two-stage transportation problem plus dynamic programming for pipeline routing, complemented by a lightweight real-time rescheduling mechanism that adjusts only phase designation and orchestration to adapt to workload shifts without reloading model parameters. The system leverages phase splitting, non-uniform batching considerations, and KV-cache compression to minimize inter-phase communication costs while sustaining model quality, achieving up to 2.1x throughput improvements and up to 2.5x latency reduction at the same price budget compared to baselines. Experimental results in heterogeneous cloud and homogeneous in-house settings demonstrate ThunderServe’s cost-efficiency and scalability, highlighting strong performance gains in SLO attainment and throughput, and validating the practicality of disaggregated cloud deployments for large-scale LLM serving.

Abstract

Recent developments in large language models (LLMs) have demonstrated their remarkable proficiency in a range of tasks. Compared to in-house homogeneous GPU clusters, deploying LLMs in cloud environments with diverse types of GPUs is crucial for addressing the GPU shortage problem and being more cost-effective. However, the diversity of network environments and various GPU types on the cloud bring difficulties to achieving high-performance serving. In this work, we propose ThunderServe, a high-performance and cost-efficient LLM serving system for heterogeneous cloud environments. We introduce a novel scheduling algorithm, which optimizes the deployment plan of LLM serving to accommodate the heterogeneous resource and network bandwidth conditions in cloud environments. Furthermore, we propose a lightweight re-scheduling mechanism, designed to adapt to fluctuating online conditions (e.g., node failures, workload shifts) without the need for costly restarts of ongoing services. Empirical results in both heterogeneous cloud and homogeneous in-house environments reveal that ThunderServe delivers up to a 2.1$\times$ and on average a $1.7\times$ increase in throughput and achieves up to a 2.5$\times$ and on average a $1.5\times$ reduction in latency deadlines compared with state-of-the-art systems given the same price budget, suggesting opting for cloud services provides a more cost-efficient solution.

ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments

TL;DR

ThunderServe tackles the challenge of serving large language models in cloud environments with heterogeneous GPUs by formulating a two-level scheduling problem that jointly optimizes GPU grouping, phase designation, parallel configuration, and inter-phase orchestration. It introduces a tabu-search-based upper-level solver and a lower-level optimization that leverages a two-stage transportation problem plus dynamic programming for pipeline routing, complemented by a lightweight real-time rescheduling mechanism that adjusts only phase designation and orchestration to adapt to workload shifts without reloading model parameters. The system leverages phase splitting, non-uniform batching considerations, and KV-cache compression to minimize inter-phase communication costs while sustaining model quality, achieving up to 2.1x throughput improvements and up to 2.5x latency reduction at the same price budget compared to baselines. Experimental results in heterogeneous cloud and homogeneous in-house settings demonstrate ThunderServe’s cost-efficiency and scalability, highlighting strong performance gains in SLO attainment and throughput, and validating the practicality of disaggregated cloud deployments for large-scale LLM serving.

Abstract

Recent developments in large language models (LLMs) have demonstrated their remarkable proficiency in a range of tasks. Compared to in-house homogeneous GPU clusters, deploying LLMs in cloud environments with diverse types of GPUs is crucial for addressing the GPU shortage problem and being more cost-effective. However, the diversity of network environments and various GPU types on the cloud bring difficulties to achieving high-performance serving. In this work, we propose ThunderServe, a high-performance and cost-efficient LLM serving system for heterogeneous cloud environments. We introduce a novel scheduling algorithm, which optimizes the deployment plan of LLM serving to accommodate the heterogeneous resource and network bandwidth conditions in cloud environments. Furthermore, we propose a lightweight re-scheduling mechanism, designed to adapt to fluctuating online conditions (e.g., node failures, workload shifts) without the need for costly restarts of ongoing services. Empirical results in both heterogeneous cloud and homogeneous in-house environments reveal that ThunderServe delivers up to a 2.1 and on average a increase in throughput and achieves up to a 2.5 and on average a reduction in latency deadlines compared with state-of-the-art systems given the same price budget, suggesting opting for cloud services provides a more cost-efficient solution.

Paper Structure

This paper contains 24 sections, 2 equations, 19 figures, 8 tables, 2 algorithms.

Figures (19)

  • Figure 1: Prefill and decode prices for a single request with input and output lengths of 512 and 16 on 3090Ti and A40.
  • Figure 2: Effects of batching on different phases (LLaMA-7B with each input having a sequence length of 1024).
  • Figure 3: Workflow of our scheduling algorithm.
  • Figure 4: Examples of neighbor construction in tabu search (changes in phase designation are omitted for simplicity).
  • Figure 5: An example orchestration of prefill and decode replicas.
  • ...and 14 more figures