Table of Contents
Fetching ...

A Data-driven ML Approach for Maximizing Performance in LLM-Adapter Serving

Ferran Agullo, Joan Oliveras, Chen Wang, Alberto Gutierrez-Torre, Olivier Tardieu, Alaa Youssef, Jordi Torres, Josep Ll. Berral

TL;DR

This work tackles maximizing throughput in single-GPU LLM-adapter serving under heterogeneous adapter properties by framing an adapter caching problem. It couples a data-driven ML pipeline with interpretable models to predict the joint configuration of concurrent adapters and adapter slots, and introduces the first Digital Twin of an LLM-adapter serving system to generate training data offline. The DT achieves throughput predictions within $5.1\%$ of real results, while ML models, especially tree-based ones, predict near-optimal configurations with maximal errors as low as $1.0\%$ for concurrent adapters and $7.2\%$ for adapter slots, with fast inference around $0.15$ ms. The approach yields interpretable rules for production deployment and demonstrates potential for scalable, low-latency adapter serving on a single GPU, with future work extending to multi-device scenarios and broader frameworks.

Abstract

With the rapid adoption of Large Language Models (LLMs), LLM-adapters have become increasingly common, providing lightweight specialization of large-scale models. Serving hundreds or thousands of these adapters on a single GPU allows request aggregation, increasing throughput, but may also cause request starvation if GPU memory limits are exceeded. To address this issue, this study focuses on determining the joint configuration of concurrent and parallel adapters that maximizes GPU throughput without inducing starvation, given heterogeneous adapter and traffic properties. We propose a data-driven ML approach leveraging interpretable models to tackle this caching problem and introduce the first Digital Twin capable of reproducing an LLM-adapter serving system, enabling efficient training data generation. Experiments with the vLLM framework and LoRA adapters show that the Digital Twin reproduces throughput within 5.1% of real results, while the ML approach predicts optimal numbers of concurrent and parallel adapters with an error of at most 7.2% under heterogeneous, real-world workloads. The code is publicly available at https://github.com/FerranAgulloLopez/GPULLMAdapterOptimization.

A Data-driven ML Approach for Maximizing Performance in LLM-Adapter Serving

TL;DR

This work tackles maximizing throughput in single-GPU LLM-adapter serving under heterogeneous adapter properties by framing an adapter caching problem. It couples a data-driven ML pipeline with interpretable models to predict the joint configuration of concurrent adapters and adapter slots, and introduces the first Digital Twin of an LLM-adapter serving system to generate training data offline. The DT achieves throughput predictions within of real results, while ML models, especially tree-based ones, predict near-optimal configurations with maximal errors as low as for concurrent adapters and for adapter slots, with fast inference around ms. The approach yields interpretable rules for production deployment and demonstrates potential for scalable, low-latency adapter serving on a single GPU, with future work extending to multi-device scenarios and broader frameworks.

Abstract

With the rapid adoption of Large Language Models (LLMs), LLM-adapters have become increasingly common, providing lightweight specialization of large-scale models. Serving hundreds or thousands of these adapters on a single GPU allows request aggregation, increasing throughput, but may also cause request starvation if GPU memory limits are exceeded. To address this issue, this study focuses on determining the joint configuration of concurrent and parallel adapters that maximizes GPU throughput without inducing starvation, given heterogeneous adapter and traffic properties. We propose a data-driven ML approach leveraging interpretable models to tackle this caching problem and introduce the first Digital Twin capable of reproducing an LLM-adapter serving system, enabling efficient training data generation. Experiments with the vLLM framework and LoRA adapters show that the Digital Twin reproduces throughput within 5.1% of real results, while the ML approach predicts optimal numbers of concurrent and parallel adapters with an error of at most 7.2% under heterogeneous, real-world workloads. The code is publicly available at https://github.com/FerranAgulloLopez/GPULLMAdapterOptimization.

Paper Structure

This paper contains 19 sections, 1 equation, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Throughput evolution with the number of concurrent adapters when varying adapter sizes, adapter request rates, request output lengths, and adapter slots. The results are shown for Llama-2-7B touvron2023llama2openfoundation and a public adapter llama-2-7b-sql-lora-test, using by default rate 0.05 reqs/s, adapter size 8, 250 input tokens, and 231 output tokens. The max crosses denote the targeted optimal of the adapter caching problem.
  • Figure 2: Maximum throughput (left) and batch size (right) evolution as the number of loaded adapters increases, shown for both models, varying adapter sizes/ranks and two datasets. Crosses indicate when no space is left for loading more adapters.
  • Figure 3: Maximum throughput (left) and (right) as the number of unique adapters in the batch increases, shown for Llama-2-7B, varying adapter sizes/ranks, and two datasets. Lines terminate at the point where the batch size can no longer be increased.
  • Figure 4: Loading times for varying adapter sizes, shown relative to request latency across the three datasets for Llama-2-7B, and storage type. Request latency is computed as $TPOT * (output\_tokens - 1)$ where TPOT is the time per output token.
  • Figure 5: Optimal concurrent adapters when working with S-LoRA (mark with crosses) with varying adapter rates in Llama-2-7B, with rank 32, 250 input tokens and 231 output tokens.
  • ...and 3 more figures