Table of Contents
Fetching ...

Topology-aware Preemptive Scheduling for Co-located LLM Workloads

Ping Zhang, Lei Su, Jinjie Yang, Xin Chen

TL;DR

A fine-grained topology-aware method for preemptive scheduling of hybrid workloads that ensures that the resources freed by preempted tasks adhere to the topological affinity needs of high-priority preemptors in a guaranteed or best-effort manner is developed.

Abstract

Hosting diverse large language model workloads in a unified resource pool through co-location is cost-effective. For example, long-running chat services generally follow diurnal traffic patterns, which inspire co-location of batch jobs to fulfill resource valleys between successive peaks, and thus to saturate resource allocation in cluster-wide scope. These heterogeneous workloads often have different business priorities, and therefore preemption can be leveraged for resource elasticity. However, workloads often have distinct topology preferences as well. The resources released by lower-priority instances may fail to meet the requirements of high-priority online services which are usually latency-sensitive. The root cause behind such mis-match is a lack of topology awareness of resource scheduler, especially during preemption. To bridge this gap, we develop a fine-grained topology-aware method for preemptive scheduling of hybrid workloads. The method ensures that the resources freed by preempted tasks adhere to the topological affinity needs of high-priority preemptors in a guaranteed or best-effort manner. This dynamic alignment significantly increases the efficiency of preemption and improves overall scheduled performance for LLM workloads by $55\%$.

Topology-aware Preemptive Scheduling for Co-located LLM Workloads

TL;DR

A fine-grained topology-aware method for preemptive scheduling of hybrid workloads that ensures that the resources freed by preempted tasks adhere to the topological affinity needs of high-priority preemptors in a guaranteed or best-effort manner is developed.

Abstract

Hosting diverse large language model workloads in a unified resource pool through co-location is cost-effective. For example, long-running chat services generally follow diurnal traffic patterns, which inspire co-location of batch jobs to fulfill resource valleys between successive peaks, and thus to saturate resource allocation in cluster-wide scope. These heterogeneous workloads often have different business priorities, and therefore preemption can be leveraged for resource elasticity. However, workloads often have distinct topology preferences as well. The resources released by lower-priority instances may fail to meet the requirements of high-priority online services which are usually latency-sensitive. The root cause behind such mis-match is a lack of topology awareness of resource scheduler, especially during preemption. To bridge this gap, we develop a fine-grained topology-aware method for preemptive scheduling of hybrid workloads. The method ensures that the resources freed by preempted tasks adhere to the topological affinity needs of high-priority preemptors in a guaranteed or best-effort manner. This dynamic alignment significantly increases the efficiency of preemption and improves overall scheduled performance for LLM workloads by .

Paper Structure

This paper contains 24 sections, 3 equations, 11 figures, 5 tables, 2 algorithms.

Figures (11)

  • Figure 1: Distributed LLM serving with workload co-location from the cluster view. Hereby, we take an online LLM inference service and an offline LLM inference job as illustrative examples of co-location. Other types of workloads are also considered for similar optimizations.
  • Figure 2: Two hardware topology examples of NVIDIA 4090 server and A100 server. The configurations are 2Sockets-8NUMAs-64Cores $\times$ 8GPUs of 4090 server, and 2Sockets-2NUMAs-128Cores $\times$ 8GPUs of A100 server.
  • Figure 3: A snapshot of resource allocation in a cluster of 4090 Server for three co-located workloads
  • Figure 4: Overview architecture of topology-aware preemption
  • Figure 5: Two illustrative examples of FlexTopo and resource allocation states
  • ...and 6 more figures