Table of Contents
Fetching ...

LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale

Jaehong Cho, Minsu Kim, Hyunmin Choi, Guseul Heo, Jongse Park

TL;DR

The paper addresses the lack of scalable, system-level simulators for LLM inference serving on heterogeneous hardware. It introduces LLMServingSim, a HW/SW co-simulation infrastructure built atop ASTRA-sim that operates at iteration-level, exploits transformer-block reuse, and supports heterogeneous accelerators with memory-aware KV cache modeling. The approach achieves close accuracy to real GPU-based serving (errors under 14.7%) while delivering substantial speedups (up to ~491× faster than certain accelerator simulators, and ~35× on average vs others), enabling rapid exploration of hardware/software co-designs and scheduling strategies. This work has practical impact for researchers and practitioners by providing a flexible, pluggable, and scalable tool to evaluate LLM serving systems across diverse hardware configurations; the project is open-source for community use and extension.

Abstract

Recently, there has been an extensive research effort in building efficient large language model (LLM) inference serving systems. These efforts not only include innovations in the algorithm and software domains but also constitute developments of various hardware acceleration techniques. Nevertheless, there is a lack of simulation infrastructure capable of accurately modeling versatile hardware-software behaviors in LLM serving systems without extensively extending the simulation time. This paper aims to develop an effective simulation tool, called LLMServingSim, to support future research in LLM serving systems. In designing LLMServingSim, we focus on two limitations of existing simulators: (1) they lack consideration of the dynamic workload variations of LLM inference serving due to its autoregressive nature, and (2) they incur repetitive simulations without leveraging algorithmic redundancies in LLMs. To address these limitations, LLMServingSim simulates the LLM serving in the granularity of iterations, leveraging the computation redundancies across decoder blocks and reusing the simulation results from previous iterations. Additionally, LLMServingSim provides a flexible framework that allows users to plug in any accelerator compiler-and-simulation stacks for exploring various system designs with heterogeneous processors. Our experiments demonstrate that LLMServingSim produces simulation results closely following the performance behaviors of real GPU-based LLM serving system with less than 14.7% error rate, while offering 91.5x faster simulation speed compared to existing accelerator simulators.

LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale

TL;DR

The paper addresses the lack of scalable, system-level simulators for LLM inference serving on heterogeneous hardware. It introduces LLMServingSim, a HW/SW co-simulation infrastructure built atop ASTRA-sim that operates at iteration-level, exploits transformer-block reuse, and supports heterogeneous accelerators with memory-aware KV cache modeling. The approach achieves close accuracy to real GPU-based serving (errors under 14.7%) while delivering substantial speedups (up to ~491× faster than certain accelerator simulators, and ~35× on average vs others), enabling rapid exploration of hardware/software co-designs and scheduling strategies. This work has practical impact for researchers and practitioners by providing a flexible, pluggable, and scalable tool to evaluate LLM serving systems across diverse hardware configurations; the project is open-source for community use and extension.

Abstract

Recently, there has been an extensive research effort in building efficient large language model (LLM) inference serving systems. These efforts not only include innovations in the algorithm and software domains but also constitute developments of various hardware acceleration techniques. Nevertheless, there is a lack of simulation infrastructure capable of accurately modeling versatile hardware-software behaviors in LLM serving systems without extensively extending the simulation time. This paper aims to develop an effective simulation tool, called LLMServingSim, to support future research in LLM serving systems. In designing LLMServingSim, we focus on two limitations of existing simulators: (1) they lack consideration of the dynamic workload variations of LLM inference serving due to its autoregressive nature, and (2) they incur repetitive simulations without leveraging algorithmic redundancies in LLMs. To address these limitations, LLMServingSim simulates the LLM serving in the granularity of iterations, leveraging the computation redundancies across decoder blocks and reusing the simulation results from previous iterations. Additionally, LLMServingSim provides a flexible framework that allows users to plug in any accelerator compiler-and-simulation stacks for exploring various system designs with heterogeneous processors. Our experiments demonstrate that LLMServingSim produces simulation results closely following the performance behaviors of real GPU-based LLM serving system with less than 14.7% error rate, while offering 91.5x faster simulation speed compared to existing accelerator simulators.
Paper Structure (42 sections, 11 figures, 1 table, 1 algorithm)

This paper contains 42 sections, 11 figures, 1 table, 1 algorithm.

Figures (11)

  • Figure 1: Architecture of large language model.
  • Figure 2: (a) Simulation time comparison between mNPUsim, GeneSys, and NeuPIMs. (b) Roofline analysis on the arithmetic intensity of LLM inference operations.
  • Figure 3: Example system topology of LLMServingSim configured with hybrid parallelism, consisting of 4 pipeline parallel groups and 4 tensor parallel NPU nodes.
  • Figure 4: Workflow of LLMServingSim.
  • Figure 5: Two example system topology of LLMServingSim with NPU and PIM hardware.
  • ...and 6 more figures