Table of Contents
Fetching ...

Priority-Aware Preemptive Scheduling for Mixed-Priority Workloads in MoE Inference

Mohammad Siavashi, Faezeh Keshmiri Dindarloo, Dejan Kostic, Marco Chiesa

TL;DR

The paper addresses HOL blocking in LLM MoE inference by introducing QLLM, a system that performs fine-grained preemption at the expert level guided by a priority-aware scheduler. It introduces per-expert queues and unified sequence/batch abstractions with a Unified Dynamic Cache to manage KV caches and state, enabling LS tasks to preempt BE tasks without losing progress. The approach yields dramatic reductions in LS time-to-first-token and turnaround, while preserving throughput, demonstrated on Nvidia A100 hardware with Mixtral 8x7B; it also integrates cleanly with Hugging Face MoE models. The work highlights modularity and extensibility, and discusses limitations such as memory overhead and potential starvation, with plans to release open-source.

Abstract

Large Language Models have revolutionized natural language processing, yet serving them efficiently in data centers remains challenging due to mixed workloads comprising latency-sensitive (LS) and best-effort (BE) jobs. Existing inference systems employ iteration-level first-come-first-served scheduling, causing head-of-line blocking when BE jobs delay LS jobs. We introduce QLLM, a novel inference system designed for Mixture of Experts (MoE) models, featuring a fine-grained, priority-aware preemptive scheduler. QLLM enables expert-level preemption, deferring BE job execution while minimizing LS time-to-first-token (TTFT). Our approach removes iteration-level scheduling constraints, enabling the scheduler to preempt jobs at any layer based on priority. Evaluations on an Nvidia A100 GPU show that QLLM significantly improves performance. It reduces LS TTFT by an average of $65.5\times$ and meets the SLO at up to $7$ requests/sec, whereas the baseline fails to do so under the tested workload. Additionally, it cuts LS turnaround time by up to $12.8\times$ without impacting throughput. QLLM is modular, extensible, and seamlessly integrates with Hugging Face MoE models.

Priority-Aware Preemptive Scheduling for Mixed-Priority Workloads in MoE Inference

TL;DR

The paper addresses HOL blocking in LLM MoE inference by introducing QLLM, a system that performs fine-grained preemption at the expert level guided by a priority-aware scheduler. It introduces per-expert queues and unified sequence/batch abstractions with a Unified Dynamic Cache to manage KV caches and state, enabling LS tasks to preempt BE tasks without losing progress. The approach yields dramatic reductions in LS time-to-first-token and turnaround, while preserving throughput, demonstrated on Nvidia A100 hardware with Mixtral 8x7B; it also integrates cleanly with Hugging Face MoE models. The work highlights modularity and extensibility, and discusses limitations such as memory overhead and potential starvation, with plans to release open-source.

Abstract

Large Language Models have revolutionized natural language processing, yet serving them efficiently in data centers remains challenging due to mixed workloads comprising latency-sensitive (LS) and best-effort (BE) jobs. Existing inference systems employ iteration-level first-come-first-served scheduling, causing head-of-line blocking when BE jobs delay LS jobs. We introduce QLLM, a novel inference system designed for Mixture of Experts (MoE) models, featuring a fine-grained, priority-aware preemptive scheduler. QLLM enables expert-level preemption, deferring BE job execution while minimizing LS time-to-first-token (TTFT). Our approach removes iteration-level scheduling constraints, enabling the scheduler to preempt jobs at any layer based on priority. Evaluations on an Nvidia A100 GPU show that QLLM significantly improves performance. It reduces LS TTFT by an average of and meets the SLO at up to requests/sec, whereas the baseline fails to do so under the tested workload. Additionally, it cuts LS turnaround time by up to without impacting throughput. QLLM is modular, extensible, and seamlessly integrates with Hugging Face MoE models.

Paper Structure

This paper contains 14 sections, 5 figures, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of a baseline and QLLM. The baseline employs iteration-level scheduling and continuous batching, with control returning to the scheduler only upon execution of all N layers. The figure on the right demonstrates a streamlined execution of QLLM's fine-grain scheduling within layer 1. LS jobs arrive after BE jobs and are batched in step 7.
  • Figure 2: QLLM reduces TTFT for LS jobs by up to $101.6\times$ while ensuring compliance with the SLO. In contrast, Hugging Face fails to meet the SLO even under low load due to priority-oblivious scheduling.
  • Figure 3: QLLM maintains a comparable or slightly higher job completion rate compared to HF.
  • Figure 4: BE Requests
  • Figure 5: LS Requests