Table of Contents
Fetching ...

The Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines

Matias Martinez

TL;DR

This paper systematically evaluates throughput for 20 open-source LLMs using two inference engines, vLLM and HuggingFace pipelines, to understand how hyperparameters shape inference performance. It demonstrates that throughput landscapes are irregular, with distinct peaks that underscore the need for hyperparameter optimization, especially when changing hardware. The authors introduce InfPop (Hyperparameter Optimization) and show that it yields meaningful throughput gains during GPU upgrades/downgrades (averages of $9.16\%$ and $13.7\%$ for HF_pl, respectively), while vLLM typically experiences smaller gains. Overall, the work highlights the practical importance of tuning hyperparameters to maximize inference throughput in production, and it provides a framework for evaluating and optimizing LLM inference across engines and hardware.

Abstract

The recent surge of open-source large language models (LLMs) enables developers to create AI-based solutions while maintaining control over aspects such as privacy and compliance, thereby providing governance and ownership of the model deployment process. To utilize these LLMs, inference engines are needed. These engines load the model's weights onto available resources, such as GPUs, and process queries to generate responses. The speed of inference, or performance, of the LLM, is critical for real-time applications, as it computes millions or billions of floating point operations per inference. Recently, advanced inference engines such as vLLM have emerged, incorporating novel mechanisms such as efficient memory management to achieve state-of-the-art performance. In this paper, we analyze the performance, particularly the throughput (tokens generated per unit of time), of 20 LLMs using two inference libraries: vLLM and HuggingFace's pipelines. We investigate how various hyperparameters, which developers must configure, influence inference performance. Our results reveal that throughput landscapes are irregular, with distinct peaks, highlighting the importance of hyperparameter optimization to achieve maximum performance. We also show that applying hyperparameter optimization when upgrading or downgrading the GPU model used for inference can improve throughput from HuggingFace pipelines by an average of 9.16% and 13.7%, respectively.

The Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines

TL;DR

This paper systematically evaluates throughput for 20 open-source LLMs using two inference engines, vLLM and HuggingFace pipelines, to understand how hyperparameters shape inference performance. It demonstrates that throughput landscapes are irregular, with distinct peaks that underscore the need for hyperparameter optimization, especially when changing hardware. The authors introduce InfPop (Hyperparameter Optimization) and show that it yields meaningful throughput gains during GPU upgrades/downgrades (averages of and for HF_pl, respectively), while vLLM typically experiences smaller gains. Overall, the work highlights the practical importance of tuning hyperparameters to maximize inference throughput in production, and it provides a framework for evaluating and optimizing LLM inference across engines and hardware.

Abstract

The recent surge of open-source large language models (LLMs) enables developers to create AI-based solutions while maintaining control over aspects such as privacy and compliance, thereby providing governance and ownership of the model deployment process. To utilize these LLMs, inference engines are needed. These engines load the model's weights onto available resources, such as GPUs, and process queries to generate responses. The speed of inference, or performance, of the LLM, is critical for real-time applications, as it computes millions or billions of floating point operations per inference. Recently, advanced inference engines such as vLLM have emerged, incorporating novel mechanisms such as efficient memory management to achieve state-of-the-art performance. In this paper, we analyze the performance, particularly the throughput (tokens generated per unit of time), of 20 LLMs using two inference libraries: vLLM and HuggingFace's pipelines. We investigate how various hyperparameters, which developers must configure, influence inference performance. Our results reveal that throughput landscapes are irregular, with distinct peaks, highlighting the importance of hyperparameter optimization to achieve maximum performance. We also show that applying hyperparameter optimization when upgrading or downgrading the GPU model used for inference can improve throughput from HuggingFace pipelines by an average of 9.16% and 13.7%, respectively.
Paper Structure (39 sections, 5 figures, 2 tables)

This paper contains 39 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Throughput landscape across the hyperparameter space (batch size and GPUs).
  • Figure 2: Throughput Variation with Different Numbers of GPUs (Nvidia A100) during Online Inference (batch size $=$ 1).
  • Figure 3: Throughput at difference batch sizes for HF$_{pl}$ using Nvidia A1000 GPUs (outlier Starcoder2-3b removed and shown in \ref{['fig:RQ3_bs_throughput_a100_autohf_all']})
  • Figure 4: Comparison of throughput using two Nvidia A1000 GPUs between HF$_{pl}$ (left) and vLLM (right)
  • Figure 5: Distribution of throughput improvement for HF$_{pl}$ given by hyperparameter optimization in hardware upgrading (from Nvidia V100 to A100) and downgrading (from A100 to V100).