Table of Contents
Fetching ...

PromptTuner: SLO-Aware Elastic System for LLM Prompt Tuning

Wei Gao, Peng Sun, Dmitrii Ustiugov, Tianwei Zhang, Yonggang Wen

TL;DR

This paper introduces PromptTuner, an SLO-aware elastic system to optimize LLM prompt tuning, an SLO-aware elastic system to optimize LLM prompt tuning, and develops aWorkload Scheduler to enable fast resource allocation to reduce the SLO violation and resource costs.

Abstract

Prompt tuning has become a prominent strategy for enhancing the performance of Large Language Models (LLMs) on downstream tasks. Many IT enterprises now offer Prompt-Tuning-as-a-Service to fulfill the growing demand for prompt tuning LLMs on downstream tasks. Their primary objective is to satisfy users Service Level Objectives (SLOs) while reducing resource provisioning costs. Nevertheless, our characterization analysis for existing deep learning resource management systems reveals that they are insufficient to optimize these objectives for LLM prompt tuning workloads. In this paper, we introduce PromptTuner, an SLO-aware elastic system to optimize LLM prompt tuning. It contains two innovations. (1) We design a Prompt Bank to identify efficient initial prompts to expedite the convergence of prompt tuning. (2) We develop aWorkload Scheduler to enable fast resource allocation to reduce the SLO violation and resource costs. In our evaluation, PromptTuner reduces SLO violations by 4.0x and 7.9x, and lowers costs by 1.6x and 4.5x, compared to INFless and ElasticFlow respectively.

PromptTuner: SLO-Aware Elastic System for LLM Prompt Tuning

TL;DR

This paper introduces PromptTuner, an SLO-aware elastic system to optimize LLM prompt tuning, an SLO-aware elastic system to optimize LLM prompt tuning, and develops aWorkload Scheduler to enable fast resource allocation to reduce the SLO violation and resource costs.

Abstract

Prompt tuning has become a prominent strategy for enhancing the performance of Large Language Models (LLMs) on downstream tasks. Many IT enterprises now offer Prompt-Tuning-as-a-Service to fulfill the growing demand for prompt tuning LLMs on downstream tasks. Their primary objective is to satisfy users Service Level Objectives (SLOs) while reducing resource provisioning costs. Nevertheless, our characterization analysis for existing deep learning resource management systems reveals that they are insufficient to optimize these objectives for LLM prompt tuning workloads. In this paper, we introduce PromptTuner, an SLO-aware elastic system to optimize LLM prompt tuning. It contains two innovations. (1) We design a Prompt Bank to identify efficient initial prompts to expedite the convergence of prompt tuning. (2) We develop aWorkload Scheduler to enable fast resource allocation to reduce the SLO violation and resource costs. In our evaluation, PromptTuner reduces SLO violations by 4.0x and 7.9x, and lowers costs by 1.6x and 4.5x, compared to INFless and ElasticFlow respectively.
Paper Structure (29 sections, 1 equation, 10 figures, 8 tables, 2 algorithms)

This paper contains 29 sections, 1 equation, 10 figures, 8 tables, 2 algorithms.

Figures (10)

  • Figure 1: An example of LLM prompt tuning. The user first prepares the LLM, the initial prompt, and the task-specific dataset, which consists of a batch of input queries and target responses. During the execution stage, it optimizes the tunable prompt starting from the initial prompt on the given dataset.
  • Figure 2: Characteristics of LPT workloads: (a) The end-to-end LPT job execution time breakdown across different LLMs. (b) A 2-hour LPT workload trace from a cluster. (c) The Iteration-To-Accuracy (ITA) distribution of various initial prompts with the SAMSUM dataset SAMSUM across different LLMs.
  • Figure 3: Characterization of existing DL systems: (a) The cluster utilization (%) ($y$-axis) in ElasticFlow over time ($x$-axis). (b) The CDF ($y$-axis) illustrates the fraction ($x$-axis) of waiting delay in the end-to-end latency caused by the instance initialization.(c) SLO violation (%) of ElasticFlow and INFless across varying maximum GPUs.
  • Figure 4: The workflow of PromptTuner. It consists of two key components: (1) The Prompt Bank identifies an effective initial prompt for an incoming LPT job at a minimal cost; (2) The Workload Scheduler dynamically adds GPUs from the GPU pool for each LPT job to reduce SLO violation while minimizing resource costs.
  • Figure 5: The illustration of performing (a) lookup, and (b) insertion & replacement on the two-layer data structure.
  • ...and 5 more figures