Table of Contents
Fetching ...

APEX: An Extensible and Dynamism-Aware Simulator for Automated Parallel Execution in LLM Serving

Yi-Chien Lin, Woosuk Kwon, Ronald Pineda, Fanny Nina Paravecino

TL;DR

APEX proposes an extensible, dynamism-aware simulator to automatically identify optimal parallel execution plans for LLM serving, by evaluating combinations of DP, PP, TP, and EP under iteration-level batching. It relies on Transformer IR and Parallel Templates to compactly represent models and plan spaces, a bottom-up Device Mapper for physical deployment, and a batching-plus-simulation loop calibrated with operation-level profiling to predict time and energy. Across multi-model, multi-trace, and multi-cluster experiments, APEX achieves substantial improvements over heuristic baselines (up to $3.37\times$ speedups) and enables energy-aware planning (up to $45\%$ energy reductions), while maintaining fidelity (average relative error ~$10.7\%$) and scalability to trillion-scale models with CPU-based planning. The framework is open-sourced and designed for easy extension to new models, devices, batching schemes, and parallelisms, offering a cost-effective and scalable tool for service providers to meet SLOs and optimize operating metrics.

Abstract

Efficiently serving Large Language Models (LLMs) requires selecting an optimal parallel execution plan, balancing computation, memory, and communication overhead. However, determining the best strategy is challenging due to varying parallelism techniques (data, pipeline, tensor) and workload characteristics (e.g., compute-intensive tasks with long prompts vs. memory-intensive tasks with long generation). We propose APEX, an LLM serving system simulator that efficiently identifies optimal parallel execution plans by considering key factors of LLM serving systems, such as memory usage, batching behavior, etc. APEX performs dynamism-aware simulation to model iteration-level batching, and leverages LLMs' repetitive structure to reduce design space, scaling efficiently to trillion-scale models. APEX abstracts the key components of LLM serving systems, including the model, batching module, quantization formats, and device clusters, enabling the simulator to be general and extensible. Simulating on a CPU, APEX evaluates execution plans for various device clusters, covering diverse LLMs and workloads. APEX finds plans up to 3.37x faster than heuristics, and also plans that reduce energy consumption by up to 45% compared to latency-optimal plans. APEX performs comprehensive evaluations, reporting key system metrics like time per output token and time to first token, which can help service providers meet SLOs. APEX identifies an optimal plan within 15 minutes on a CPU, making it 71x faster and 1234x more cost-effective than cloud-based GPU deployment. APEX can be accessed at https://github.com/microsoft/apex_plus

APEX: An Extensible and Dynamism-Aware Simulator for Automated Parallel Execution in LLM Serving

TL;DR

APEX proposes an extensible, dynamism-aware simulator to automatically identify optimal parallel execution plans for LLM serving, by evaluating combinations of DP, PP, TP, and EP under iteration-level batching. It relies on Transformer IR and Parallel Templates to compactly represent models and plan spaces, a bottom-up Device Mapper for physical deployment, and a batching-plus-simulation loop calibrated with operation-level profiling to predict time and energy. Across multi-model, multi-trace, and multi-cluster experiments, APEX achieves substantial improvements over heuristic baselines (up to speedups) and enables energy-aware planning (up to energy reductions), while maintaining fidelity (average relative error ~) and scalability to trillion-scale models with CPU-based planning. The framework is open-sourced and designed for easy extension to new models, devices, batching schemes, and parallelisms, offering a cost-effective and scalable tool for service providers to meet SLOs and optimize operating metrics.

Abstract

Efficiently serving Large Language Models (LLMs) requires selecting an optimal parallel execution plan, balancing computation, memory, and communication overhead. However, determining the best strategy is challenging due to varying parallelism techniques (data, pipeline, tensor) and workload characteristics (e.g., compute-intensive tasks with long prompts vs. memory-intensive tasks with long generation). We propose APEX, an LLM serving system simulator that efficiently identifies optimal parallel execution plans by considering key factors of LLM serving systems, such as memory usage, batching behavior, etc. APEX performs dynamism-aware simulation to model iteration-level batching, and leverages LLMs' repetitive structure to reduce design space, scaling efficiently to trillion-scale models. APEX abstracts the key components of LLM serving systems, including the model, batching module, quantization formats, and device clusters, enabling the simulator to be general and extensible. Simulating on a CPU, APEX evaluates execution plans for various device clusters, covering diverse LLMs and workloads. APEX finds plans up to 3.37x faster than heuristics, and also plans that reduce energy consumption by up to 45% compared to latency-optimal plans. APEX performs comprehensive evaluations, reporting key system metrics like time per output token and time to first token, which can help service providers meet SLOs. APEX identifies an optimal plan within 15 minutes on a CPU, making it 71x faster and 1234x more cost-effective than cloud-based GPU deployment. APEX can be accessed at https://github.com/microsoft/apex_plus

Paper Structure

This paper contains 29 sections, 9 figures, 5 tables, 1 algorithm.

Figures (9)

  • Figure 1: An example two-level device cluster. Memory bandwidth and latency are uniform within the same level.
  • Figure 2: System overview of APEX
  • Figure 3: The Parallel Execution Plan Generator produces plans that map a given LLM onto the target cluster using distinct parallelization strategies
  • Figure 4: Transformer IR represents LLMs in a canonical way
  • Figure 5: Examples of APEX's Parallel Templates for two devices. The parameterized templates can extend to D devices.
  • ...and 4 more figures