Comparative Analysis of Large Language Model Inference Serving Systems: A Performance Study of vLLM and HuggingFace TGI
Saicharan Kolluru
TL;DR
This work addresses the practical problem of deploying large language models in production by comparing two prominent open-source serving frameworks, vLLM and HuggingFace TGI, across multiple LLaMA-2 model sizes (7B, 13B, 70B) on NVIDIA A100 GPUs. It employs a comprehensive methodology with realistic workloads drawn from ShareGPT, measuring throughput, latency (including p50/p95/p99), Time-to-First Token, per-token generation time, GPU utilization, and memory usage, across varied concurrency and generation patterns. The study reveals that vLLM achieves up to 24x higher throughput under high concurrency thanks to PagedAttention and continuous batching, while TGI delivers lower initial latency (TTFT) at low concurrency and offers better integration with the HuggingFace ecosystem; memory savings of 19–27% with vLLM enable larger batches, and GPU utilization is higher for vLLM (85–92% vs 68–74%). Practical guidance emerges: use vLLM for high-throughput, memory-constrained, or multi-tenant deployments; opt for TGI for latency-sensitive interactive workloads and ease of deployment, with hybrid strategies as a potential path forward. The findings provide actionable benchmarks and architectural insights to inform framework selection in production settings and highlight avenues for future benchmarking across more models, quantization regimes, and hardware platforms.
Abstract
The deployment of Large Language Models (LLMs) in production environments requires efficient inference serving systems that balance throughput, latency, and resource utilization. This paper presents a comprehensive empirical evaluation of two prominent open-source LLM serving frameworks: vLLM and HuggingFace Text Generation Inference (TGI). We benchmark these systems across multiple dimensions including throughput performance, end-to-end latency, GPU memory utilization, and scalability characteristics using LLaMA-2 models ranging from 7B to 70B parameters. Our experiments reveal that vLLM achieves up to 24x higher throughput than TGI under high-concurrency workloads through its novel PagedAttention mechanism, while TGI demonstrates lower tail latencies for interactive single-user scenarios. We provide detailed performance profiles for different deployment scenarios and offer practical recommendations for system selection based on workload characteristics. Our findings indicate that the choice between these frameworks should be guided by specific use-case requirements: vLLM excels in high-throughput batch processing scenarios, while TGI is better suited for latency-sensitive interactive applications with moderate concurrency.
