Vidur: A Large-Scale Simulation Framework For LLM Inference

Amey Agrawal; Nitin Kedia; Jayashree Mohan; Ashish Panwar; Nipun Kwatra; Bhargav Gulavani; Ramachandran Ramjee; Alexey Tumanov

Vidur: A Large-Scale Simulation Framework For LLM Inference

Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav Gulavani, Ramachandran Ramjee, Alexey Tumanov

TL;DR

Vidur introduces a high-fidelity simulation framework for LLM inference to tackle the prohibitive cost of deployment exploration across model-parallelism, scheduling, batching, and workload knobs. By combining a profiling-driven runtime estimator with a hierarchical scheduler and a benchmark suite (Vidur-Bench) plus a configuration-search tool (Vidur-Search), the approach enables fast, inexpensive end-to-end performance evaluation and workload-aware deployment optimization. The authors demonstrate fidelity with less than $9\%$ end-to-end latency error across models and workloads and show that Vidur-Search can identify optimal configurations (e.g., LLaMA2-70B) in hours on CPU versus thousands of GPU hours on real hardware, translating to substantial cost savings. This work provides a practical, scalable path to deploying cost-effective LLM inference at scale in production environments.

Abstract

Optimizing the deployment of Large language models (LLMs) is expensive today since it requires experimentally running an application workload against an LLM implementation while exploring large configuration space formed by system knobs such as parallelization strategies, batching techniques, and scheduling policies. To address this challenge, we present Vidur - a large-scale, high-fidelity, easily-extensible simulation framework for LLM inference performance. Vidur models the performance of LLM operators using a combination of experimental profiling and predictive modeling, and evaluates the end-to-end inference performance for different workloads by estimating several metrics of interest such as latency and throughput. We validate the fidelity of Vidur on several LLMs and show that it estimates inference latency with less than 9% error across the range. Further, we present Vidur-Search, a configuration search tool that helps optimize LLM deployment. Vidur-Search uses Vidur to automatically identify the most cost-effective deployment configuration that meets application performance constraints. For example, Vidur-Search finds the best deployment configuration for LLaMA2-70B in one hour on a CPU machine, in contrast to a deployment-based exploration which would require 42K GPU hours - costing ~218K dollars. Source code for Vidur is available at https://github.com/microsoft/vidur.

Vidur: A Large-Scale Simulation Framework For LLM Inference

TL;DR

end-to-end latency error across models and workloads and show that Vidur-Search can identify optimal configurations (e.g., LLaMA2-70B) in hours on CPU versus thousands of GPU hours on real hardware, translating to substantial cost savings. This work provides a practical, scalable path to deploying cost-effective LLM inference at scale in production environments.

Abstract

Paper Structure (25 sections, 8 figures, 2 tables)

This paper contains 25 sections, 8 figures, 2 tables.

Introduction
Background and Motivation
Overview of LLMs
LLM Inference Efficiency Optimizations
LLM Inference Configuration Space
Challenges in Simulating LLM Inference
Vidur
Key Insights
System Overview
Profiler
Runtime Estimator
Hierarchical Scheduler
Vidur-Bench
Datasets and workloads
Performance metrics
...and 10 more sections

Figures (8)

Figure 1: Both the model and workload matter for the optimal deployment configuration. Optimal configurations for each model-trace pair is shown in (a). Throughput/cost can differ significantly for the same model if the workload is changed as shown in (b).
Figure 2: Vidur Simulator High Level Architecture.
Figure 3: Fidelity of Vidur's request execution time prediction for four models and three static traces.
Figure 4: Fidelity of Vidur's execution time predictions across four models and three dynamic workload traces, using request load at 85% of the maximum serving capacity for each scenario.
Figure 5: Capacity per dollar for different deployment configurations vs corresponding TTFT-P90 (left) and TBT-P99 (middle). Also show is the Pareto curve for these configurations. Shaded area corresponds to region where the corresponding SLO is satisfied. (right) Both latency metrics for these configuration, with capacity per dollar visualized via a temperature colormap. In the left and middle plots, green points correspond to configurations which satisfy SLOs for both metrics. Note that blue points on a Pareto curve show that, even Pareto curve points for one metric may not satisfy SLO for the other metric.
...and 3 more figures

Vidur: A Large-Scale Simulation Framework For LLM Inference

TL;DR

Abstract

Vidur: A Large-Scale Simulation Framework For LLM Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (8)