Table of Contents
Fetching ...

Harli: SLO-Aware Co-location of LLM Inference and PEFT-based Finetuning on Model-as-a-Service Platforms

Ao Xu, Han Zhao, Weihao Cui, Quan Chen, Yukang Chen, Shulai Zhang, Shuang Chen, Jiemin Jiang, Zhibin Yu, Minyi Guo

TL;DR

Harli tackles underutilization in MaaS LLM serving by co-locating PEFT-based finetuning with memory-bound decode tasks on the same GPU. It introduces a unified memory allocator to reuse KV-cache memory for finetune work, a two-stage latency predictor to model solo and co-run latency, and a QoS-aware scheduler that dynamically partitions SMs to meet inference targets while maximizing finetune throughput. The system demonstrates substantial finetune throughput gains (average ~46 extpercent, up to ~92 extpercent) across LLaMA3-8B and Qwen2.5-7B with strict decode QoS, and shows robustness to tensor-parallel configurations. This work offers a practical, deployable approach to improve GPU utilization in MaaS LLM services without compromising latency guarantees, with clear implications for deployment on diverse GPU families and models.

Abstract

Large language models (LLMs) are increasingly deployed under the Model-as-a-Service (MaaS) paradigm. To meet stringent quality-of-service (QoS) requirements, existing LLM serving systems disaggregate the prefill and decode phases of inference. However, decode instances often experience low GPU utilization due to their memory-bound nature and insufficient batching in dynamic workloads, leaving compute resources underutilized. We introduce Harli, a serving system that improves GPU utilization by co-locating parameter-efficient finetuning (PEFT) tasks with LLM decode instances. PEFT tasks are compute-bound and memory-efficient, making them ideal candidates for safe co-location. Specifically, Harli addresses key challenges--limited memory and unpredictable interference--using three components: a unified memory allocator for runtime memory reuse, a two-stage latency predictor for decode latency modeling, and a QoS-guaranteed throughput-maximizing scheduler for throughput maximization. Experimental results show that Harli improves the finetune throughput by 46.2% on average (up to 92.0%) over state-of-the-art serving systems, while maintaining strict QoS guarantees for inference decode.

Harli: SLO-Aware Co-location of LLM Inference and PEFT-based Finetuning on Model-as-a-Service Platforms

TL;DR

Harli tackles underutilization in MaaS LLM serving by co-locating PEFT-based finetuning with memory-bound decode tasks on the same GPU. It introduces a unified memory allocator to reuse KV-cache memory for finetune work, a two-stage latency predictor to model solo and co-run latency, and a QoS-aware scheduler that dynamically partitions SMs to meet inference targets while maximizing finetune throughput. The system demonstrates substantial finetune throughput gains (average ~46 extpercent, up to ~92 extpercent) across LLaMA3-8B and Qwen2.5-7B with strict decode QoS, and shows robustness to tensor-parallel configurations. This work offers a practical, deployable approach to improve GPU utilization in MaaS LLM services without compromising latency guarantees, with clear implications for deployment on diverse GPU families and models.

Abstract

Large language models (LLMs) are increasingly deployed under the Model-as-a-Service (MaaS) paradigm. To meet stringent quality-of-service (QoS) requirements, existing LLM serving systems disaggregate the prefill and decode phases of inference. However, decode instances often experience low GPU utilization due to their memory-bound nature and insufficient batching in dynamic workloads, leaving compute resources underutilized. We introduce Harli, a serving system that improves GPU utilization by co-locating parameter-efficient finetuning (PEFT) tasks with LLM decode instances. PEFT tasks are compute-bound and memory-efficient, making them ideal candidates for safe co-location. Specifically, Harli addresses key challenges--limited memory and unpredictable interference--using three components: a unified memory allocator for runtime memory reuse, a two-stage latency predictor for decode latency modeling, and a QoS-guaranteed throughput-maximizing scheduler for throughput maximization. Experimental results show that Harli improves the finetune throughput by 46.2% on average (up to 92.0%) over state-of-the-art serving systems, while maintaining strict QoS guarantees for inference decode.

Paper Structure

This paper contains 43 sections, 5 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: Throughput of prefill and decode phase with varied batch sizes and sequence lengths.
  • Figure 2: The general architecture of an LLM model.
  • Figure 3: The decode batch size of inference tasks under a real-world trace splitwise.
  • Figure 4: The DRAM bandwidth and SM utilization of the decode phase under different configurations.
  • Figure 5: The throughput improvement of finetune tasks under different configurations.
  • ...and 9 more figures