Table of Contents
Fetching ...

Hierarchical Autoscaling for Large Language Model Serving with Chiron

Archit Patke, Dhemath Reddy, Saurabh Jha, Chandra Narayanaswami, Zbigniew Kalbarczyk, Ravishankar Iyer

TL;DR

This work addresses the challenge of meeting latency-oriented SLOs for large language model serving, distinguishing interactive (tight SLOs) from batch (relaxed SLOs) workloads. It introduces Chiron, a hierarchical autoscaler that applies local backpressure to adapt per-instance batch sizes and a global backpressure to adjust cluster size, with request groups enabling multiplexing between traffic types. The approach yields up to 90% better SLO attainment, up to 300% higher throughput, and up to 70% GPU savings versus prior autoscalers, demonstrating practical gains in both interactive and batch scenarios. The results underscore the value of combining per-instance tuning with cluster-wide provisioning and grouping to improve utilization and responsiveness in real-world LLM deployments.

Abstract

Large language model (LLM) serving is becoming an increasingly important workload for cloud providers. Based on performance SLO requirements, LLM inference requests can be divided into (a) interactive requests that have tight SLOs in the order of seconds, and (b) batch requests that have relaxed SLO in the order of minutes to hours. These SLOs can degrade based on the arrival rates, multiplexing, and configuration parameters, thus necessitating the use of resource autoscaling on serving instances and their batch sizes. However, previous autoscalers for LLM serving do not consider request SLOs leading to unnecessary scaling and resource under-utilization. To address these limitations, we introduce Chiron, an autoscaler that uses the idea of hierarchical backpressure estimated using queue size, utilization, and SLOs. Our experiments show that Chiron achieves up to 90% higher SLO attainment and improves GPU efficiency by up to 70% compared to existing solutions.

Hierarchical Autoscaling for Large Language Model Serving with Chiron

TL;DR

This work addresses the challenge of meeting latency-oriented SLOs for large language model serving, distinguishing interactive (tight SLOs) from batch (relaxed SLOs) workloads. It introduces Chiron, a hierarchical autoscaler that applies local backpressure to adapt per-instance batch sizes and a global backpressure to adjust cluster size, with request groups enabling multiplexing between traffic types. The approach yields up to 90% better SLO attainment, up to 300% higher throughput, and up to 70% GPU savings versus prior autoscalers, demonstrating practical gains in both interactive and batch scenarios. The results underscore the value of combining per-instance tuning with cluster-wide provisioning and grouping to improve utilization and responsiveness in real-world LLM deployments.

Abstract

Large language model (LLM) serving is becoming an increasingly important workload for cloud providers. Based on performance SLO requirements, LLM inference requests can be divided into (a) interactive requests that have tight SLOs in the order of seconds, and (b) batch requests that have relaxed SLO in the order of minutes to hours. These SLOs can degrade based on the arrival rates, multiplexing, and configuration parameters, thus necessitating the use of resource autoscaling on serving instances and their batch sizes. However, previous autoscalers for LLM serving do not consider request SLOs leading to unnecessary scaling and resource under-utilization. To address these limitations, we introduce Chiron, an autoscaler that uses the idea of hierarchical backpressure estimated using queue size, utilization, and SLOs. Our experiments show that Chiron achieves up to 90% higher SLO attainment and improves GPU efficiency by up to 70% compared to existing solutions.
Paper Structure (22 sections, 2 equations, 19 figures, 2 algorithms)

This paper contains 22 sections, 2 equations, 19 figures, 2 algorithms.

Figures (19)

  • Figure 1: Illustration comparing Chiron with previous systems. Chiron uses less instances (five for Chiron vs. three for previous systems) because of (a) global autoscaling based on queuing and request multiplexing, and (b) local autoscaling based on dynamic batch sizes.
  • Figure 2: Previously proposed LLM serving systems overestimate backpressure leading to cluster-wide underutilization. (Left) Cluster-wide utilization when serving a mix of batch and interactive requests for Llama 8B and Llama 70B. (Right) GPUs required to serve the workload across various autoscalers. "Local" and "Global" are Chiron's autoscalers when used independently.
  • Figure 3: Variation in inter-token latency and token throughput with increasing batch size.
  • Figure 4: Request arrival spikes in a production serving cluster over a 5 hour duration.
  • Figure 5: Over-provisioning required for varying burstiness.
  • ...and 14 more figures

Theorems & Definitions (3)

  • Definition 2.1
  • Definition 2.2
  • Definition 2.3