Table of Contents
Fetching ...

Fairness in Serving Large Language Models

Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, Ion Stoica

TL;DR

This work defines fairness for serving large language models as token-level resource sharing and introduces the Virtual Token Counter (VTC), a continuous-batching aware scheduler that guarantees a bounded disparity in processed tokens across backlogged clients. It provides a formal 2x bound on service difference and demonstrates work-conserving behavior, integrating seamlessly with existing LLM serving stacks. Through extensive synthetic and real-workload experiments, VTC outperforms FCFS and RPM baselines under varied conditions and supports extensions such as weighted, length-prediction, and profiled-cost variants. The approach offers practical, scalable fairness for multi-tenant LLM inference, with reproducible code and clear paths to deployment in production systems.

Abstract

High-demand LLM inference services (e.g., ChatGPT and BARD) support a wide range of requests from short chat conversations to long document reading. To ensure that all client requests are processed fairly, most major LLM inference services have request rate limits, to ensure that no client can dominate the request queue. However, this rudimentary notion of fairness also results in under-utilization of the resources and poor client experience when there is spare capacity. While there is a rich literature on fair scheduling, serving LLMs presents new challenges due to their unpredictable request lengths and their unique batching characteristics on parallel accelerators. This paper introduces the definition of LLM serving fairness based on a cost function that accounts for the number of input and output tokens processed. To achieve fairness in serving, we propose a novel scheduling algorithm, the Virtual Token Counter (VTC), a fair scheduler based on the continuous batching mechanism. We prove a 2x tight upper bound on the service difference between two backlogged clients, adhering to the requirement of work-conserving. Through extensive experiments, we demonstrate the superior performance of VTC in ensuring fairness, especially in contrast to other baseline methods, which exhibit shortcomings under various conditions. The reproducible code is available at https://github.com/Ying1123/VTC-artifact

Fairness in Serving Large Language Models

TL;DR

This work defines fairness for serving large language models as token-level resource sharing and introduces the Virtual Token Counter (VTC), a continuous-batching aware scheduler that guarantees a bounded disparity in processed tokens across backlogged clients. It provides a formal 2x bound on service difference and demonstrates work-conserving behavior, integrating seamlessly with existing LLM serving stacks. Through extensive synthetic and real-workload experiments, VTC outperforms FCFS and RPM baselines under varied conditions and supports extensions such as weighted, length-prediction, and profiled-cost variants. The approach offers practical, scalable fairness for multi-tenant LLM inference, with reproducible code and clear paths to deployment in production systems.

Abstract

High-demand LLM inference services (e.g., ChatGPT and BARD) support a wide range of requests from short chat conversations to long document reading. To ensure that all client requests are processed fairly, most major LLM inference services have request rate limits, to ensure that no client can dominate the request queue. However, this rudimentary notion of fairness also results in under-utilization of the resources and poor client experience when there is spare capacity. While there is a rich literature on fair scheduling, serving LLMs presents new challenges due to their unpredictable request lengths and their unique batching characteristics on parallel accelerators. This paper introduces the definition of LLM serving fairness based on a cost function that accounts for the number of input and output tokens processed. To achieve fairness in serving, we propose a novel scheduling algorithm, the Virtual Token Counter (VTC), a fair scheduler based on the continuous batching mechanism. We prove a 2x tight upper bound on the service difference between two backlogged clients, adhering to the requirement of work-conserving. Through extensive experiments, we demonstrate the superior performance of VTC in ensuring fairness, especially in contrast to other baseline methods, which exhibit shortcomings under various conditions. The reproducible code is available at https://github.com/Ying1123/VTC-artifact
Paper Structure (53 sections, 12 theorems, 29 equations, 20 figures, 6 tables, 4 algorithms)

This paper contains 53 sections, 12 theorems, 29 equations, 20 figures, 6 tables, 4 algorithms.

Key Result

Lemma 4.2

The following invariant holds at any time in alg:vtc when $Q\neq \emptyset$:

Figures (20)

  • Figure 1: Serving architecture with Virtual Token Counter (VTC), illustrated with two clients. VTC maintains a queue of requests and keeps track of tokens served for each client. In each iteration of the LLM execution engine, some tokens from some clients are generated. The counters of these clients are correspondingly updated. When the condition of adding new requests is satisfied (e.g. memory is released when some other requests finish), VTC will be invoked to choose the requests to be added. VTC achieves fairness by prioritizing clients with the lowest counter and carefully handling clients' leave and rejoin (Section \ref{['sec:vtc']}).
  • Figure 2: An illustration of how request length can affect the cost and server capacity in terms of throughput. The visualized length is not precise but for illustration purposes only.
  • Figure 3: Two clients with different request rates and both overloaded. Client 1 sends 90 requests per minute. Client 2 sends 180 requests per minute, both evenly spaced out so that each request is sent at a consistent time interval throughout the minute. Every request has input lengths of 256 and output lengths of 256. Both clients are backlogged because they exceed the server capacity.
  • Figure 4: Client 3 who is overloaded can consume more than its share as Clients 1 and 2 are sending requests lower than their share. Clients 1, 2, and 3 send 15, 30, and 90 requests per minute, respectively, under uniform distribution. Requests have input lengths of 256 and output lengths of 256. Client 3 is backlogged, while Clients 1 and 2 are not.
  • Figure 5: ON/OFF request pattern. Client 1 sends 30 requests per minute (less than half of the capacity) during the ON phase and switches to OFF phase periodically. Client 2 is always in the ON phase, sending requests at a rate of 120 requests per minute (larger than half of the capacity). Requests have input lengths of 256 and output lengths of 256.
  • ...and 15 more figures

Theorems & Definitions (26)

  • Definition 4.1: Backlog
  • Definition 4.2: Fairness adapted from stf
  • Lemma 4.2
  • Theorem 4.3: Fairness for overloaded clients
  • proof
  • Remark 4.4
  • Remark 4.5
  • Remark 4.6
  • Theorem 4.7
  • Theorem 4.8
  • ...and 16 more