Table of Contents
Fetching ...

Ensuring Fair LLM Serving Amid Diverse Applications

Redwan Ibne Seraj Khan, Kunal Jain, Haiying Shen, Ankur Mallick, Anjaly Parayil, Anoop Kulkarni, Steve Kofsky, Pankhuri Choudhary, Renèe St. Amant, Rujia Wang, Yue Cheng, Ali R. Butt, Victor Rühle, Chetan Bansal, Saravan Rajmohan

TL;DR

FairServe proposes application-characteristic aware request throttling coupled with a weighted service counter based scheduling technique to curb abusive behavior and ensure fairness, and experimental results on real-world traces demonstrate FairServe's superior performance compared to the state-of-the-art method in ensuring fairness.

Abstract

In a multi-tenant large language model (LLM) serving platform hosting diverse applications, some users may submit an excessive number of requests, causing the service to become unavailable to other users and creating unfairness. Existing fairness approaches do not account for variations in token lengths across applications and multiple LLM calls, making them unsuitable for such platforms. To address the fairness challenge, this paper analyzes millions of requests from thousands of users on MS CoPilot, a real-world multi-tenant LLM platform hosted by Microsoft. Our analysis confirms the inadequacy of existing methods and guides the development of FairServe, a system that ensures fair LLM access across diverse applications. FairServe proposes application-characteristic aware request throttling coupled with a weighted service counter based scheduling technique to curb abusive behavior and ensure fairness. Our experimental results on real-world traces demonstrate FairServe's superior performance compared to the state-of-the-art method in ensuring fairness. We are actively working on deploying our system in production, expecting to benefit millions of customers world-wide.

Ensuring Fair LLM Serving Amid Diverse Applications

TL;DR

FairServe proposes application-characteristic aware request throttling coupled with a weighted service counter based scheduling technique to curb abusive behavior and ensure fairness, and experimental results on real-world traces demonstrate FairServe's superior performance compared to the state-of-the-art method in ensuring fairness.

Abstract

In a multi-tenant large language model (LLM) serving platform hosting diverse applications, some users may submit an excessive number of requests, causing the service to become unavailable to other users and creating unfairness. Existing fairness approaches do not account for variations in token lengths across applications and multiple LLM calls, making them unsuitable for such platforms. To address the fairness challenge, this paper analyzes millions of requests from thousands of users on MS CoPilot, a real-world multi-tenant LLM platform hosted by Microsoft. Our analysis confirms the inadequacy of existing methods and guides the development of FairServe, a system that ensures fair LLM access across diverse applications. FairServe proposes application-characteristic aware request throttling coupled with a weighted service counter based scheduling technique to curb abusive behavior and ensure fairness. Our experimental results on real-world traces demonstrate FairServe's superior performance compared to the state-of-the-art method in ensuring fairness. We are actively working on deploying our system in production, expecting to benefit millions of customers world-wide.

Paper Structure

This paper contains 19 sections, 3 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: Example of an LLM interaction.
  • Figure 2: Users' varying RPM and TPM across apps make prior policies ineffective for ensuring fairness and detering abusive behavior.
  • Figure 3: Token counts differ across applications, suggesting that LLM scheduling must consider variations in user applications.
  • Figure 4: The variability of LLM calls across apps must be considered to reduce latencies and queueing delays in multi-agent apps.
  • Figure 5: System prompts lead to varying characteristics in total tokens of each interaction and the number of output tokens
  • ...and 4 more figures