LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services

Małgorzata Łazuka; Andreea Anghel; Thomas Parnell

LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services

Małgorzata Łazuka, Andreea Anghel, Thomas Parnell

TL;DR

LLM-Pilot is presented - a first-of-its-kind system for characterizing and predicting performance of LLM inference services and learns a predictive model, which can be used to recommend the most cost-effective hardware for a previously unseen LLM.

Abstract

As Large Language Models (LLMs) are rapidly growing in popularity, LLM inference services must be able to serve requests from thousands of users while satisfying performance requirements. The performance of an LLM inference service is largely determined by the hardware onto which it is deployed, but understanding of which hardware will deliver on performance requirements remains challenging. In this work we present LLM-Pilot - a first-of-its-kind system for characterizing and predicting performance of LLM inference services. LLM-Pilot performs benchmarking of LLM inference services, under a realistic workload, across a variety of GPUs, and optimizes the service configuration for each considered GPU to maximize performance. Finally, using this characterization data, LLM-Pilot learns a predictive model, which can be used to recommend the most cost-effective hardware for a previously unseen LLM. Compared to existing methods, LLM-Pilot can deliver on performance requirements 33% more frequently, whilst reducing costs by 60% on average.

LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services

TL;DR

Abstract

Paper Structure (32 sections, 8 equations, 8 figures, 4 tables)

This paper contains 32 sections, 8 equations, 8 figures, 4 tables.

Introduction
Background
LLM inference performance requirements
LLM inference servers
Deploying LLM inference services
Performance Characterization Tool
Analysis of production traces
Workload generator
Modelling the requests
Sampling requests
Performance data collection
Deployment
Tuning the batch weight
Load testing
Other considerations
...and 17 more sections

Figures (8)

Figure 1: Median end-to-end latency achieved by the inference service of a selected LLM (bigcode/starcoder starcoder) deployed on one A100 GPU with varying maximum batch weight, for 128 concurrent users.
Figure 2: Architecture of the performance characterization tool.
Figure 3: Correlation between selected parameters of requests from the production traces.
Figure 4: The MDI importance scores of the number of CPU cores, amount of memory, maximum batch weight and number of concurrent users, determined by a RF predicting the TTFT and ITL latency for a selected LLM (bigcode/starcoder starcoder).
Figure 5: Architecture of the GPU recommendation tool.
...and 3 more figures

LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services

TL;DR

Abstract

LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services

Authors

TL;DR

Abstract

Table of Contents

Figures (8)