Table of Contents
Fetching ...

Data Driven Optimization of GPU efficiency for Distributed LLM Adapter Serving

Ferran Agullo, Joan Oliveras, Chen Wang, Alberto Gutierrez-Torre, Olivier Tardieu, Alaa Youssef, Jordi Torres, Josep Ll. Berral

TL;DR

A data-driven pipeline that, for a given workload, computes an adapter placement that serves the workload with the minimum number of GPUs while avoiding request starvation and GPU memory errors, highlighting its versatility for future large-scale LLM serving infrastructures.

Abstract

Large Language Model (LLM) adapters enable low-cost model specialization, but introduce complex caching and scheduling challenges in distributed serving systems where hundreds of adapters must be hosted concurrently. While prior work has largely focused on latency minimization, resource efficiency through throughput maximization remains underexplored. This paper presents a data-driven pipeline that, for a given workload, computes an adapter placement that serves the workload with the minimum number of GPUs while avoiding request starvation and GPU memory errors. To that end, the approach identifies the maximum feasible throughput attainable on each GPU by leveraging accurate performance predictions learned from real serving behavior. The proposed pipeline integrates three components: (i) a Digital Twin (DT) tailored to LLM-adapter serving, (ii) a distilled machine learning (ML) model trained on DT-generated data, and (iii) a greedy placement algorithm that exploits ML-based performance estimates to maximize GPU efficiency. The DT emulates real system dynamics with high fidelity, achieving below 5% throughput estimation error while executing up to 90 times faster than full LLM benchmarking across both predictable and unpredictable workloads. The learned ML models further accelerate performance estimation with marginal accuracy degradation, enabling scalable optimization. Experimental results demonstrate that the pipeline substantially improves GPU efficiency by reducing the number of GPUs required to sustain target workloads. Beyond GPU efficiency, the pipeline can be adapted to alternative objectives, such as latency minimization, highlighting its versatility for future large-scale LLM serving infrastructures.

Data Driven Optimization of GPU efficiency for Distributed LLM Adapter Serving

TL;DR

A data-driven pipeline that, for a given workload, computes an adapter placement that serves the workload with the minimum number of GPUs while avoiding request starvation and GPU memory errors, highlighting its versatility for future large-scale LLM serving infrastructures.

Abstract

Large Language Model (LLM) adapters enable low-cost model specialization, but introduce complex caching and scheduling challenges in distributed serving systems where hundreds of adapters must be hosted concurrently. While prior work has largely focused on latency minimization, resource efficiency through throughput maximization remains underexplored. This paper presents a data-driven pipeline that, for a given workload, computes an adapter placement that serves the workload with the minimum number of GPUs while avoiding request starvation and GPU memory errors. To that end, the approach identifies the maximum feasible throughput attainable on each GPU by leveraging accurate performance predictions learned from real serving behavior. The proposed pipeline integrates three components: (i) a Digital Twin (DT) tailored to LLM-adapter serving, (ii) a distilled machine learning (ML) model trained on DT-generated data, and (iii) a greedy placement algorithm that exploits ML-based performance estimates to maximize GPU efficiency. The DT emulates real system dynamics with high fidelity, achieving below 5% throughput estimation error while executing up to 90 times faster than full LLM benchmarking across both predictable and unpredictable workloads. The learned ML models further accelerate performance estimation with marginal accuracy degradation, enabling scalable optimization. Experimental results demonstrate that the pipeline substantially improves GPU efficiency by reducing the number of GPUs required to sustain target workloads. Beyond GPU efficiency, the pipeline can be adapted to alternative objectives, such as latency minimization, highlighting its versatility for future large-scale LLM serving infrastructures.
Paper Structure (35 sections, 1 equation, 14 figures, 5 tables, 2 algorithms)

This paper contains 35 sections, 1 equation, 14 figures, 5 tables, 2 algorithms.

Figures (14)

  • Figure 1: Throughput as a function of the number of served adapters under varying adapter sizes (left), arrival rates (center) and configured $A_{max}$ (right). Each line corresponds to experiments in which all parameters are held constant except for the number of adapters. Results were obtained using vLLM with Llama-2-7B touvron2023llama2openfoundation and a public LoRA adapter llama27bsqlloratest.Unless otherwise specified, the default configuration is: adapter size 8 (rank), per-adapter arrival rate 0.05 req/s, 250 input tokens per request, and 231 output tokens per request. In the two leftmost plots, $A_{max}$ is set equal to the number of served adapters. $S_{max}$ is configured to match the adapter size used in each experiment.
  • Figure 2: Proposed data-driven pipeline to address the adapter caching problem (right), shown alongside its expected usage within a production system (left). The numbered red markers indicate the recommended reading order of the workflow.
  • Figure 3: Digital Twin behavior and architecture.
  • Figure 4: Evolution of batch size and throughput with increasing numbers of loaded adapters (left, center) and ITL versus batch size (right), across models and adapter sizes. Crosses denote the point where GPU memory is exhausted. Measurements were obtained by oversaturating a single-GPU system and issuing backbone-only requests to isolate the memory overhead of adapter weights. Minor variations in across adapter sizes are nevertheless observed, likely due to additional operations triggered in vLLM when adapters are activated, even if unused.
  • Figure 5: Throughput slowdown and overhead for increasing adapters across three adapter sizes. Results are shown for Llama-2-7B and relative to backbone-only execution. To avoid the impact of adapter weights, we fix the batch size and number of loaded adapters within each line. Lines are shorter for larger adapter sizes due to their reduced achievable batch size, which limits the maximum number of runnable adapters.
  • ...and 9 more figures