Table of Contents
Fetching ...

Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity

Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, Ion Stoica

TL;DR

The paper addresses the high cost of deploying LLMs by analyzing how GPU type choice interacts with workload characteristics. It shows that request size, rate, and SLO jointly determine cost efficiency and that a heterogeneous mix of GPUs often yields the lowest cost. Mélange formalizes this as a cost-aware bin packing problem and solves it via an ILP using offline GPU profiling to derive MaxTput per workload bucket. Empirical results across multiple GPUs and datasets show substantial cost reductions (up to 77%) while maintaining SLO satisfaction, demonstrating the practicality of heterogeneity-aware LLM serving.

Abstract

Large language models (LLMs) are increasingly integrated into many online services, yet they remain cost-prohibitive to deploy due to the requirement of expensive GPU instances. Prior work has addressed the high cost of LLM serving by improving the inference engine, but less attention has been given to selecting the most cost-efficient GPU type(s) for a specific LLM service. There is a large and growing landscape of GPU types and, within these options, higher cost does not always lead to increased performance. Instead, through a comprehensive investigation, we find that three key LLM service characteristics (request size, request rate, SLO) strongly influence GPU cost efficiency, and differing GPU types are most cost efficient for differing LLM service settings. As a result, the most cost-efficient allocation for a given service is typically a mix of heterogeneous GPU types. Based on this analysis, we introduce Mélange, a GPU allocation framework that navigates these diverse LLM service characteristics and heterogeneous GPU option space to automatically and efficiently derive the minimal-cost GPU allocation for a given LLM service. We formulate the GPU allocation task as a cost-aware bin packing problem where GPUs are bins and items are slices of the service workload. Our formulation's constraints account for a service's unique characteristics, allowing Mélange to be flexible to support diverse service settings and heterogeneity-aware to adapt the GPU allocation to a specific service. Compared to using only a single GPU type, Mélange reduces deployment costs by up to 77% in conversational settings, 33% in document-based settings, and 51% in a mixed setting.

Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity

TL;DR

The paper addresses the high cost of deploying LLMs by analyzing how GPU type choice interacts with workload characteristics. It shows that request size, rate, and SLO jointly determine cost efficiency and that a heterogeneous mix of GPUs often yields the lowest cost. Mélange formalizes this as a cost-aware bin packing problem and solves it via an ILP using offline GPU profiling to derive MaxTput per workload bucket. Empirical results across multiple GPUs and datasets show substantial cost reductions (up to 77%) while maintaining SLO satisfaction, demonstrating the practicality of heterogeneity-aware LLM serving.

Abstract

Large language models (LLMs) are increasingly integrated into many online services, yet they remain cost-prohibitive to deploy due to the requirement of expensive GPU instances. Prior work has addressed the high cost of LLM serving by improving the inference engine, but less attention has been given to selecting the most cost-efficient GPU type(s) for a specific LLM service. There is a large and growing landscape of GPU types and, within these options, higher cost does not always lead to increased performance. Instead, through a comprehensive investigation, we find that three key LLM service characteristics (request size, request rate, SLO) strongly influence GPU cost efficiency, and differing GPU types are most cost efficient for differing LLM service settings. As a result, the most cost-efficient allocation for a given service is typically a mix of heterogeneous GPU types. Based on this analysis, we introduce Mélange, a GPU allocation framework that navigates these diverse LLM service characteristics and heterogeneous GPU option space to automatically and efficiently derive the minimal-cost GPU allocation for a given LLM service. We formulate the GPU allocation task as a cost-aware bin packing problem where GPUs are bins and items are slices of the service workload. Our formulation's constraints account for a service's unique characteristics, allowing Mélange to be flexible to support diverse service settings and heterogeneity-aware to adapt the GPU allocation to a specific service. Compared to using only a single GPU type, Mélange reduces deployment costs by up to 77% in conversational settings, 33% in document-based settings, and 51% in a mixed setting.
Paper Structure (29 sections, 5 equations, 12 figures, 8 tables)

This paper contains 29 sections, 5 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Mélange framework.
  • Figure 2: Request latency of different input/output lengths on A100-80G.
  • Figure 3: Figure (a) depicts A10G and A100's relative T/$ across request sizes. Figure (b) expands (a) into separate input and output length dimensions. Tile colors indicate which GPU achieves higher T/$, and values represent the percent increase of T/$ relative to the less cost efficient GPU.
  • Figure 4: (a) depicts the absolute batch sizes of A10G and A100 serving Llama2-7b at maximum saturation, (b) reports the same batch sizes divided by GPU cost, plotting with respect to A10G.
  • Figure 5: Comparison of L4, A10G, A100, and H100. Tile colors indicates the GPU with greatest T/$. (a) tile values are the T/$ %-increase of the best GPU compared to the second best for that tile. (b) compares the best GPU to the worst GPU. In black boxes, only A100 and H100 are compared.
  • ...and 7 more figures