Table of Contents
Fetching ...

LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure

Jaehong Cho, Hyunmin Choi, Jongse Park

TL;DR

LLMServingSim2.0 tackles the fragmentation between hardware and software evaluation in large-language-model serving by offering a unified, trace-driven simulator that integrates heterogeneous accelerators and a broad set of serving techniques. It introduces four key advancements: trace-driven performance modeling with an operator-level profiler, multi-instance and P/D disaggregation, MoE expert routing and offloading, and memory-aware prefix caching, all within a single framework. The approach yields dramatic reductions in hardware-integration effort ($\approx$ 258 LoC versus 4764) and faster offline profiling, while achieving latency/throughput trends close to real hardware (errors typically below $5\%$) across configurations, enabling practical design-space exploration. By supporting flexible routing, cache management, and scheduling, LLMServingSim2.0 offers a valuable tool for both hardware developers and LLM service providers to evaluate diverse hardware/software configurations at scale with realistic performance characteristics.

Abstract

This paper introduces LLMServingSim2.0, a system simulator designed for exploring heterogeneous hardware in large-scale LLM serving systems. LLMServingSim2.0 addresses two key limitations of its predecessor: (1) integrating hardware models into system-level simulators is non-trivial due to the lack of a clear abstraction, and (2) existing simulators support only a narrow subset of serving techniques, leaving no infrastructure that captures the breadth of approaches in modern LLM serving. To overcome these issues, LLMServingSim2.0 adopts trace-driven performance modeling, accompanied by an operator-level latency profiler, enabling the integration of new accelerators with a single command. It further embeds up-to-date serving techniques while exposing flexible interfaces for request routing, cache management, and scheduling policies. In a TPU case study, our profiler requires 18.5x fewer LoC and outperforms the predecessor's hardware-simulator integration, demonstrating LLMServingSim2.0's low-effort hardware extensibility. Our experiments further show that LLMServingSim2.0 reproduces GPU-based LLM serving with 1.9% error, while maintaining practical simulation time, making it a comprehensive platform for both hardware developers and LLM service providers.

LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure

TL;DR

LLMServingSim2.0 tackles the fragmentation between hardware and software evaluation in large-language-model serving by offering a unified, trace-driven simulator that integrates heterogeneous accelerators and a broad set of serving techniques. It introduces four key advancements: trace-driven performance modeling with an operator-level profiler, multi-instance and P/D disaggregation, MoE expert routing and offloading, and memory-aware prefix caching, all within a single framework. The approach yields dramatic reductions in hardware-integration effort ( 258 LoC versus 4764) and faster offline profiling, while achieving latency/throughput trends close to real hardware (errors typically below ) across configurations, enabling practical design-space exploration. By supporting flexible routing, cache management, and scheduling, LLMServingSim2.0 offers a valuable tool for both hardware developers and LLM service providers to evaluate diverse hardware/software configurations at scale with realistic performance characteristics.

Abstract

This paper introduces LLMServingSim2.0, a system simulator designed for exploring heterogeneous hardware in large-scale LLM serving systems. LLMServingSim2.0 addresses two key limitations of its predecessor: (1) integrating hardware models into system-level simulators is non-trivial due to the lack of a clear abstraction, and (2) existing simulators support only a narrow subset of serving techniques, leaving no infrastructure that captures the breadth of approaches in modern LLM serving. To overcome these issues, LLMServingSim2.0 adopts trace-driven performance modeling, accompanied by an operator-level latency profiler, enabling the integration of new accelerators with a single command. It further embeds up-to-date serving techniques while exposing flexible interfaces for request routing, cache management, and scheduling policies. In a TPU case study, our profiler requires 18.5x fewer LoC and outperforms the predecessor's hardware-simulator integration, demonstrating LLMServingSim2.0's low-effort hardware extensibility. Our experiments further show that LLMServingSim2.0 reproduces GPU-based LLM serving with 1.9% error, while maintaining practical simulation time, making it a comprehensive platform for both hardware developers and LLM service providers.

Paper Structure

This paper contains 12 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of LLMServingSim2.0 simulator.
  • Figure 2: Latency and throughput comparison of vLLM and LLMServingSim2.0, across five system configurations.
  • Figure 3: Simulation time comparison of LLMServingSim2.0 against LLMServingSim and LLMServingSim+ across multiple system configurations.