Table of Contents
Fetching ...

BOute: Cost-Efficient LLM Serving with Heterogeneous LLMs and GPUs via Multi-Objective Bayesian Optimization

Youhe Jiang, Fangcheng Fu, Eiko Yoneki

TL;DR

BOute addresses cost-efficient LLM serving by jointly optimizing query routing and model deployment across heterogeneous LLMs and GPUs. It casts the problem as a two-objective optimization balancing latency $L$ and quality $Q$ under resource and budget constraints, solved via a MOBO framework with offline preparation and online optimization. The approach introduces load-fraction encoding, model-GPU preference encoding, and constrained qNEHVI within additive GP kernels to produce Pareto-optimal configurations that outperform baselines by up to 2.6× in latency or by reducing costs by 38–40% while preserving target quality. Results demonstrate that exploiting heterogeneity in both models and hardware, coupled with principled MOBO scheduling, yields scalable, cost-efficient LLM serving with potential real-world impact.

Abstract

The rapid growth of large language model (LLM) deployments has made cost-efficient serving systems essential. Recent efforts to enhance system cost-efficiency adopt two main perspectives: (i) An algorithmic perspective that exploits heterogeneous model capabilities to route simpler queries to lower-cost models and complex queries to higher-cost models (i.e., heterogeneous query routing); and (ii) a systems perspective that utilizes heterogeneous GPU resources as cost-effective alternatives to homogeneous high-end GPUs (i.e., heterogeneous model deployment). However, algorithm-system co-design for cost-efficient LLM serving necessitates sophisticated management: (i) Determining optimal query routing strategies under latency and quality requirements, (ii) configuring model deployment across heterogeneous GPUs with appropriate resource allocation and parallelism strategies, and (iii) co-optimizing routing and deployment decisions to maximize overall system performance. To address these challenges, we present BOute, a quality-aware scheduling system that jointly exploits heterogeneous model and GPU capabilities for cost-efficient LLM serving. BOute employs a multi-objective Bayesian optimization (MOBO) framework to co-optimize the routing strategy and model deployment, thereby maximizing the cost-efficiency of the serving system while guaranteeing response quality. Evaluation results demonstrate that BOute outperforms state-of-the-art LLM serving systems by up to 157% and 59% on average under identical cost budgets and quality requirements, or reducing serving costs by 15%-61% (38% on average) while maintaining the same performance targets, validating its effectiveness in achieving cost-efficient LLM serving.

BOute: Cost-Efficient LLM Serving with Heterogeneous LLMs and GPUs via Multi-Objective Bayesian Optimization

TL;DR

BOute addresses cost-efficient LLM serving by jointly optimizing query routing and model deployment across heterogeneous LLMs and GPUs. It casts the problem as a two-objective optimization balancing latency and quality under resource and budget constraints, solved via a MOBO framework with offline preparation and online optimization. The approach introduces load-fraction encoding, model-GPU preference encoding, and constrained qNEHVI within additive GP kernels to produce Pareto-optimal configurations that outperform baselines by up to 2.6× in latency or by reducing costs by 38–40% while preserving target quality. Results demonstrate that exploiting heterogeneity in both models and hardware, coupled with principled MOBO scheduling, yields scalable, cost-efficient LLM serving with potential real-world impact.

Abstract

The rapid growth of large language model (LLM) deployments has made cost-efficient serving systems essential. Recent efforts to enhance system cost-efficiency adopt two main perspectives: (i) An algorithmic perspective that exploits heterogeneous model capabilities to route simpler queries to lower-cost models and complex queries to higher-cost models (i.e., heterogeneous query routing); and (ii) a systems perspective that utilizes heterogeneous GPU resources as cost-effective alternatives to homogeneous high-end GPUs (i.e., heterogeneous model deployment). However, algorithm-system co-design for cost-efficient LLM serving necessitates sophisticated management: (i) Determining optimal query routing strategies under latency and quality requirements, (ii) configuring model deployment across heterogeneous GPUs with appropriate resource allocation and parallelism strategies, and (iii) co-optimizing routing and deployment decisions to maximize overall system performance. To address these challenges, we present BOute, a quality-aware scheduling system that jointly exploits heterogeneous model and GPU capabilities for cost-efficient LLM serving. BOute employs a multi-objective Bayesian optimization (MOBO) framework to co-optimize the routing strategy and model deployment, thereby maximizing the cost-efficiency of the serving system while guaranteeing response quality. Evaluation results demonstrate that BOute outperforms state-of-the-art LLM serving systems by up to 157% and 59% on average under identical cost budgets and quality requirements, or reducing serving costs by 15%-61% (38% on average) while maintaining the same performance targets, validating its effectiveness in achieving cost-efficient LLM serving.
Paper Structure (21 sections, 5 equations, 5 figures, 4 tables)

This paper contains 21 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Upper: Illustration of system load assignment and resource allocation for approaches 1-3. Lower Left: Performance comparison among the baseline and 3 approaches. Lower Right: Model performance across GPU types. The latency is normalized by each model's P95 latency on H100 GPUs.
  • Figure 2: Illustration of the two phases in the MOBO framework.
  • Figure 3: Illustration of single- and multi-replica deployments.
  • Figure 4: Experimental results of BOute compared with different baselines on GSM8K and MTBench workloads under different quality requirements. We select the minimum P95 latency routing strategy and model deployment from the Pareto-optimal solution set. GSM8K-87 represents a quality requirement of 87 (in terms of aggregated accuracy), MTBench-8.1 represents a quality requirement of 8.1 (in terms of aggregated score), and so on.
  • Figure 5: P95 latency results across different quality requirements.