Table of Contents
Fetching ...

Adaptive Orchestration for Large-Scale Inference on Heterogeneous Accelerator Systems Balancing Cost, Performance, and Resilience

Yahav Biran, Imry Kissos

TL;DR

This work tackles scaling large-scale, latency-sensitive inference across heterogeneous accelerators while controlling costs. It introduces a hardware-agnostic control loop with two operating modes—cost-optimized and capacity-optimized—driven by real-time cost and capacity signals, and formalizes this through an optimization framework and a capacity-dynamics state machine. The methodology describes a cloud-native, layered architecture (Data Plane, Model Execution Layer, Resource Orchestration) with containerization, dynamic scaling, and hardware-agnostic policies, implemented on Kubernetes/EKS and AWS Neuron/GPU stacks. Experimental results on Stable Diffusion show consistent latency targets, effective failover during capacity shortfalls, and cost-aware traffic distribution, illustrating practical pathways to scale generative workloads with resilience across diverse hardware. Overall, the system enables efficient, resilient inference at scale by coordinating cross-hardware execution through a feedback-driven deployment strategy spanning software and hardware layers.

Abstract

The surge in generative AI workloads has created a need for scalable inference systems that can flexibly harness both GPUs and specialized accelerators while containing operational costs. This paper proposes a hardware-agnostic control loop that adaptively allocates requests across heterogeneous accelerators based on real-time cost and capacity signals. The approach sustains low latency and high throughput by dynamically shifting between cost-optimized and capacity-optimized modes, ensuring the most efficient use of expensive compute resources under fluctuating availability. Evaluated using the Stable Diffusion model, the framework consistently meets latency targets, automatically redirects traffic during capacity shortfalls, and capitalizes on lower-cost accelerators when possible. These results highlight how a feedback-driven deployment strategy, spanning the entire software and hardware stack, can help organizations efficiently scale generative AI workloads while maintaining resilience in the face of limited accelerator capacity.

Adaptive Orchestration for Large-Scale Inference on Heterogeneous Accelerator Systems Balancing Cost, Performance, and Resilience

TL;DR

This work tackles scaling large-scale, latency-sensitive inference across heterogeneous accelerators while controlling costs. It introduces a hardware-agnostic control loop with two operating modes—cost-optimized and capacity-optimized—driven by real-time cost and capacity signals, and formalizes this through an optimization framework and a capacity-dynamics state machine. The methodology describes a cloud-native, layered architecture (Data Plane, Model Execution Layer, Resource Orchestration) with containerization, dynamic scaling, and hardware-agnostic policies, implemented on Kubernetes/EKS and AWS Neuron/GPU stacks. Experimental results on Stable Diffusion show consistent latency targets, effective failover during capacity shortfalls, and cost-aware traffic distribution, illustrating practical pathways to scale generative workloads with resilience across diverse hardware. Overall, the system enables efficient, resilient inference at scale by coordinating cross-hardware execution through a feedback-driven deployment strategy spanning software and hardware layers.

Abstract

The surge in generative AI workloads has created a need for scalable inference systems that can flexibly harness both GPUs and specialized accelerators while containing operational costs. This paper proposes a hardware-agnostic control loop that adaptively allocates requests across heterogeneous accelerators based on real-time cost and capacity signals. The approach sustains low latency and high throughput by dynamically shifting between cost-optimized and capacity-optimized modes, ensuring the most efficient use of expensive compute resources under fluctuating availability. Evaluated using the Stable Diffusion model, the framework consistently meets latency targets, automatically redirects traffic during capacity shortfalls, and capitalizes on lower-cost accelerators when possible. These results highlight how a feedback-driven deployment strategy, spanning the entire software and hardware stack, can help organizations efficiently scale generative AI workloads while maintaining resilience in the face of limited accelerator capacity.

Paper Structure

This paper contains 26 sections, 13 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: The diagram illustrates a load balancer distributing traffic to five parallel model applications labeled "Model app SD21". Each application consists of compute framework and hardware accelerator components. The first two apps use CUDA (eager-mode) and Triton (graph-mode) with A10G accelerators. The third uses Triton with L4. The fourth and fifth run on Triton and Neuron frameworks, powered by Trn1 and Inf2 accelerators, respectively. Karpenter provisions resources through NodePools and NodeClasses managing NVIDIA (A10G, L4) and Neuron (Trn1, Inf2) instances. Arrows show the hierarchical relationships.
  • Figure 2: Model execution layer showing AI model, PyTorch, and supporting components.
  • Figure 3: Model quality baseline under non-determinism
  • Figure 4: look for the breaking point—when latency exceeds the set thresholds—on models loaded on Neuron and NVIDIA accelerators (Figure 4) or when the compute usage reaches over 80%. We load test the application for each compute accelerator and framework combination, such as Inf2, Trn1, or GPU with CUDA, NeuronX, or Triton. The results define the $N^{modelProcessed}_i(t)$ that the autoscaler, KEDA, uses to scale the required number of $DU^p_i$ for each deployment combination. The breaking point occurs when throughput plateaus and latency exceeds 900 milliseconds. Below are the load tests conducted on A10G, L4 NVIDIA cores, and Inf2 and Trn1 Neuron cores.
  • Figure 5: The top graph shows the throughput (requests per second) for different deployment units over time, indicating a peak around mid-experiment, with the sd21-inf2-counter having the highest throughput. The bottom graph presents the inference latency per deployment, where the cost-optimized deployments maintain consistently low latency, while others show slightly higher variations. The top graph displays the total inference throughput, where successful requests (2XX) increase steadily and then plateau, with minimal error responses (5XX). The bottom graph depicts GPU/Neuron utilization, highlighting that while GPU utilization is consistent, Neuron utilization fluctuates significantly throughout the experiment.
  • ...and 2 more figures