Table of Contents
Fetching ...

Preble: Efficient Distributed Prompt Scheduling for LLM Serving

Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, Yiying Zhang

TL;DR

This paper proposes Preble, the first distributed LLM serving platform that targets and optimizes for prompt sharing and designed a distributed scheduling system that co-optimizes KV state reuse and computation load-balancing with a new scheduling algorithm and a hierarchical scheduling mechanism.

Abstract

Prompts to large language models (LLMs) have evolved beyond simple user questions. For LLMs to solve complex problems, today's practices are to include domain-specific instructions, illustration of tool usages, and/or long context such as textbook chapters in prompts. As such, many parts of prompts are repetitive across requests. Recent works propose to cache and reuse KV state of prompts. However, they are all confined to a single-GPU optimization, while production LLM serving systems are distributed by nature. This paper proposes Preble, the first distributed LLM serving platform that targets and optimizes for prompt sharing. We designed a distributed scheduling system that co-optimizes KV state reuse and computation load-balancing with a new scheduling algorithm and a hierarchical scheduling mechanism. Our evaluation of Preble with real workloads and request arrival patterns on two open-source LLMs shows that Preble outperforms the SOTA serving systems by 1.5X to 14.5X on average latency and 2X to 10X on p99 latency.

Preble: Efficient Distributed Prompt Scheduling for LLM Serving

TL;DR

This paper proposes Preble, the first distributed LLM serving platform that targets and optimizes for prompt sharing and designed a distributed scheduling system that co-optimizes KV state reuse and computation load-balancing with a new scheduling algorithm and a hierarchical scheduling mechanism.

Abstract

Prompts to large language models (LLMs) have evolved beyond simple user questions. For LLMs to solve complex problems, today's practices are to include domain-specific instructions, illustration of tool usages, and/or long context such as textbook chapters in prompts. As such, many parts of prompts are repetitive across requests. Recent works propose to cache and reuse KV state of prompts. However, they are all confined to a single-GPU optimization, while production LLM serving systems are distributed by nature. This paper proposes Preble, the first distributed LLM serving platform that targets and optimizes for prompt sharing. We designed a distributed scheduling system that co-optimizes KV state reuse and computation load-balancing with a new scheduling algorithm and a hierarchical scheduling mechanism. Our evaluation of Preble with real workloads and request arrival patterns on two open-source LLMs shows that Preble outperforms the SOTA serving systems by 1.5X to 14.5X on average latency and 2X to 10X on p99 latency.
Paper Structure (24 sections, 12 figures, 1 table, 3 algorithms)

This paper contains 24 sections, 12 figures, 1 table, 3 algorithms.

Figures (12)

  • Figure 1: Prompt Sharing Features of Five Workloads.Green boxes represent shared prefixes. Grey boxes are non-shared prompts. White boxes are output generation. Red boxes contain statistics in average values: "prompt-length:output-length; shared token percentage; number of requests sharing a sequence".
  • Figure 2: Preble Architecture.
  • Figure 3: End-to-end Workload PerformanceThe top and middle two rows run on two and four A6000 GPUs with the Mistral 7B model. The bottom two rows run on eight H100 GPUs set up as 4-GPU tensor parallelism plus data parallelism with the Llama-3 70B model.
  • Figure 4: Mixed Workload With Azure TraceRunning Tool and Video mixed workloads with Azure trace arrival patterns on 4 A6000 GPUs.
  • Figure 5: Ablation ResultsRunning ToolBench with Zipf-1.1 skew to different prompts running on four A6000 GPUs
  • ...and 7 more figures