Table of Contents
Fetching ...

ML Inference Scheduling with Predictable Latency

Haidong Zhao, Nikolaos Georgantas

TL;DR

Unpredictable interference in GPU-based ML inference scheduling can cause deadline violations while pursuing high utilization. The authors critique existing coarse-grained and static interference predictors and evaluate how co-location dynamics and workload drift degrade accuracy. They implement a linear-regression predictor using co-located batch metrics on NVIDIA GPUs, exploring fully ignoring versus EWMA-smoothed co-location signals and online versus offline learning (SGD, RLS). They find that static/offline models struggle with changing workloads, online learning improves accuracy with RLS offering faster convergence, and outline directions to generalize the approach for broader cloud/on-prem deployment and higher concurrency.

Abstract

Machine learning (ML) inference serving systems can schedule requests to improve GPU utilization and to meet service level objectives (SLOs) or deadlines. However, improving GPU utilization may compromise latency-sensitive scheduling, as concurrent tasks contend for GPU resources and thereby introduce interference. Given that interference effects introduce unpredictability in scheduling, neglecting them may compromise SLO or deadline satisfaction. Nevertheless, existing interference prediction approaches remain limited in several respects, which may restrict their usefulness for scheduling. First, they are often coarse-grained, which ignores runtime co-location dynamics and thus restricts their accuracy in interference prediction. Second, they tend to use a static prediction model, which may not effectively cope with different workload characteristics. To this end, we evaluate the potential limitations of existing interference prediction approaches and outline our ongoing work toward achieving efficient ML inference scheduling.

ML Inference Scheduling with Predictable Latency

TL;DR

Unpredictable interference in GPU-based ML inference scheduling can cause deadline violations while pursuing high utilization. The authors critique existing coarse-grained and static interference predictors and evaluate how co-location dynamics and workload drift degrade accuracy. They implement a linear-regression predictor using co-located batch metrics on NVIDIA GPUs, exploring fully ignoring versus EWMA-smoothed co-location signals and online versus offline learning (SGD, RLS). They find that static/offline models struggle with changing workloads, online learning improves accuracy with RLS offering faster convergence, and outline directions to generalize the approach for broader cloud/on-prem deployment and higher concurrency.

Abstract

Machine learning (ML) inference serving systems can schedule requests to improve GPU utilization and to meet service level objectives (SLOs) or deadlines. However, improving GPU utilization may compromise latency-sensitive scheduling, as concurrent tasks contend for GPU resources and thereby introduce interference. Given that interference effects introduce unpredictability in scheduling, neglecting them may compromise SLO or deadline satisfaction. Nevertheless, existing interference prediction approaches remain limited in several respects, which may restrict their usefulness for scheduling. First, they are often coarse-grained, which ignores runtime co-location dynamics and thus restricts their accuracy in interference prediction. Second, they tend to use a static prediction model, which may not effectively cope with different workload characteristics. To this end, we evaluate the potential limitations of existing interference prediction approaches and outline our ongoing work toward achieving efficient ML inference scheduling.

Paper Structure

This paper contains 11 sections, 2 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Requests to the same models can be consolidated and executed as a single execution unit, or batch, to improve GPU utilization.
  • Figure 2: Request throughput across varying batch sizes for each model. This benchmark calculates throughput solely based on the profiled inference execution duration, without accounting for runtime dynamics or data transfer latency between the host and GPU memory. Therefore, the reported values represent a theoretical maximum throughput.
  • Figure 3: The number of co-located tasks on a GPU with respect to the 99th percentile latency under sequential execution and concurrent execution, respectively. All tasks deploy the same model, ResNet-50 he_deep_2016, with inference loads evenly distributed among them. Under stress testing, larger batch sizes may readily aggregate and thereby saturate GPU resources; however, employing concurrency can still help mitigate HoL blocking.
  • Figure 4: Illustration of potential batch co-location dynamics in chronological order.
  • Figure 5: Relative interference prediction error when runtime dynamics are fully ignored (Static) or partially considered (EWMA under varying $\alpha$).