Table of Contents
Fetching ...

FlexLLM: Token-Level Co-Serving of LLM Inference and Finetuning with SLO Guarantees

Gabriele Oliaro, Xupeng Miao, Xinhao Cheng, Vineeth Kada, Mengdi Wu, Ruohan Gao, Yingyi Huang, Remi Delacourt, April Yang, Yingcheng Wang, Colin Unger, Zhihao Jia

TL;DR

FlexLLM tackles the inefficiency of separating inference and finetuning on dedicated GPU clusters by introducing token-level co-serving of LLM inference and PEFT finetuning. Its core contributions are static compilation with dependent parallelization and graph pruning to reduce memory, a token-level finetuning mechanism, and a hybrid token scheduler that maintains inference SLOs while maximizing finetuning throughput. The system achieves up to 80% activation memory reduction, sustains inference SLOs at high concurrency, and delivers substantial finetuning throughput gains under both heavy and light workloads. The work demonstrates significant practical impact by enabling tighter hardware utilization, reduced energy costs, and scalable PEFT-based customization on modern LLMs.

Abstract

Finetuning large language models (LLMs) is essential for task adaptation, yet today's serving stacks isolate inference and finetuning on separate GPU clusters -- wasting resources and under-utilizing hardware. We introduce FlexLLM, the first system to co-serve LLM inference and PEFT-based finetuning on shared GPUs by fusing computation at the token level. FlexLLM's static compilation optimizations -- dependent parallelization and graph pruning significantly shrink activation memory, leading to end-to-end GPU memory savings by up to 80%. At runtime, a novel token-level finetuning mechanism paired with a hybrid token scheduler dynamically interleaves inference and training tokens within each co-serving iteration, meeting strict latency SLOs while maximizing utilization. In end-to-end benchmarks on LLaMA-3.1-8B, Qwen-2.5-14B, and Qwen-2.5-32B, FlexLLM maintains inference SLO compliance at up to 20 req/s, and improves finetuning throughput by $1.9-4.8\times$ under heavy inference workloads and $2.5-6.8\times$ under light loads, preserving over 76% of peak finetuning progress even at peak demand. FlexLLM is publicly available at https://flexllm.github.io.

FlexLLM: Token-Level Co-Serving of LLM Inference and Finetuning with SLO Guarantees

TL;DR

FlexLLM tackles the inefficiency of separating inference and finetuning on dedicated GPU clusters by introducing token-level co-serving of LLM inference and PEFT finetuning. Its core contributions are static compilation with dependent parallelization and graph pruning to reduce memory, a token-level finetuning mechanism, and a hybrid token scheduler that maintains inference SLOs while maximizing finetuning throughput. The system achieves up to 80% activation memory reduction, sustains inference SLOs at high concurrency, and delivers substantial finetuning throughput gains under both heavy and light workloads. The work demonstrates significant practical impact by enabling tighter hardware utilization, reduced energy costs, and scalable PEFT-based customization on modern LLMs.

Abstract

Finetuning large language models (LLMs) is essential for task adaptation, yet today's serving stacks isolate inference and finetuning on separate GPU clusters -- wasting resources and under-utilizing hardware. We introduce FlexLLM, the first system to co-serve LLM inference and PEFT-based finetuning on shared GPUs by fusing computation at the token level. FlexLLM's static compilation optimizations -- dependent parallelization and graph pruning significantly shrink activation memory, leading to end-to-end GPU memory savings by up to 80%. At runtime, a novel token-level finetuning mechanism paired with a hybrid token scheduler dynamically interleaves inference and training tokens within each co-serving iteration, meeting strict latency SLOs while maximizing utilization. In end-to-end benchmarks on LLaMA-3.1-8B, Qwen-2.5-14B, and Qwen-2.5-32B, FlexLLM maintains inference SLO compliance at up to 20 req/s, and improves finetuning throughput by under heavy inference workloads and under light loads, preserving over 76% of peak finetuning progress even at peak demand. FlexLLM is publicly available at https://flexllm.github.io.
Paper Structure (41 sections, 3 theorems, 2 equations, 14 figures, 2 tables, 4 algorithms)

This paper contains 41 sections, 3 theorems, 2 equations, 14 figures, 2 tables, 4 algorithms.

Key Result

Lemma 1

When $Q \neq \emptyset$,

Figures (14)

  • Figure 1: Comparing different resource sharing approaches for serving finetuning and inference. For spatial sharing and co-serving, the height of rounded rectangles illustrates the splitting ratio of GPU resources (e.g., streaming multi-processors).
  • Figure 2: An overview of FlexLLM.
  • Figure 3: Four possible parallel states for a tensor dimension and their transitions. For each parallel state, the symbol in parenthesis shows the notation FlexLLM used to represent it.
  • Figure 4: Illustration of FlexLLM's different dependent parallelization strategies with the LoRA example. Each green (or gray) box indicates a compute (or parallelization) operator, and each edge between operators represents a parallel tensor, and the parallelization states of the tensor's dimensions are shown next to the edge.
  • Figure 5: Static graph pruning for an MLP model with LoRA.
  • ...and 9 more figures

Theorems & Definitions (3)

  • Lemma 1
  • Theorem 1: Fairness for overloaded tenants
  • Theorem 2: Fairness for non-overloaded tenants