Table of Contents
Fetching ...

Compass: Mapping Space Exploration for Multi-Chiplet Accelerators Targeting LLM Inference Serving Workloads

Boyu Li, Zongwei Zhu, Yi Xiong, Qianyue Cao, Jiawei Geng, Xiaonan Zhang, Xi Li

TL;DR

Compass tackles the challenge of mapping space exploration for multi-chiplet accelerators under dynamic LLM inference workloads characterized by mixed request types and variable sequence lengths. It introduces a computation execution graph-based encoding and a GA-driven mapping generation engine within the Compass framework, paired with an evaluation engine that models intra- and inter-chiplet latency and energy via a data-access-flag memory model. The approach enables static mappings tuned to sequence-length distributions and heterogeneous hardware, achieving substantial energy-delay product reductions over state-of-the-art baselines. The work demonstrates improved adaptability across diverse models and serving strategies, and provides an open-source implementation to foster broader adoption. Overall, Compass advances co-optimization of hardware topology and dynamic LLM workloads for efficient serving at scale.

Abstract

Large Language Models (LLMs) impose massive computational demands, driving the need for scalable multi-chiplet accelerators. However, existing mapping space exploration efforts for such accelerators primarily focus on traditional CNN/Transformer workloads and fail to adequately support the dynamic behaviors of mixed request types and variable sequence lengths in real-world LLM inference serving. To bridge this gap, we first propose a computation execution graph-based mapping encoding scheme that decouples micro-batches and layers, enabling fine-grained execution control on heterogeneous chiplets and flexibly representing various parallelism strategies. Second, building upon this scheme, we develop the Compass framework, which integrates an evaluation engine and a genetic algorithm-based mapping generation engine to achieve efficient mapping search. Compared to state-of-the-art works, our solution achieves an average EDP reduction of 63.12%.

Compass: Mapping Space Exploration for Multi-Chiplet Accelerators Targeting LLM Inference Serving Workloads

TL;DR

Compass tackles the challenge of mapping space exploration for multi-chiplet accelerators under dynamic LLM inference workloads characterized by mixed request types and variable sequence lengths. It introduces a computation execution graph-based encoding and a GA-driven mapping generation engine within the Compass framework, paired with an evaluation engine that models intra- and inter-chiplet latency and energy via a data-access-flag memory model. The approach enables static mappings tuned to sequence-length distributions and heterogeneous hardware, achieving substantial energy-delay product reductions over state-of-the-art baselines. The work demonstrates improved adaptability across diverse models and serving strategies, and provides an open-source implementation to foster broader adoption. Overall, Compass advances co-optimization of hardware topology and dynamic LLM workloads for efficient serving at scale.

Abstract

Large Language Models (LLMs) impose massive computational demands, driving the need for scalable multi-chiplet accelerators. However, existing mapping space exploration efforts for such accelerators primarily focus on traditional CNN/Transformer workloads and fail to adequately support the dynamic behaviors of mixed request types and variable sequence lengths in real-world LLM inference serving. To bridge this gap, we first propose a computation execution graph-based mapping encoding scheme that decouples micro-batches and layers, enabling fine-grained execution control on heterogeneous chiplets and flexibly representing various parallelism strategies. Second, building upon this scheme, we develop the Compass framework, which integrates an evaluation engine and a genetic algorithm-based mapping generation engine to achieve efficient mapping search. Compared to state-of-the-art works, our solution achieves an average EDP reduction of 63.12%.

Paper Structure

This paper contains 18 sections, 1 equation, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: Comparison between workload characteristics in real-world inference serving scenarios and those supported by existing MSE works. P:$<$num$>$ denotes a prefill request with a sequence length of $<$num$>$. D denotes decode request, and CP denotes chunked prefill request.
  • Figure 2: LLM Processing in Inference Serving.
  • Figure 3: Multi-chiplet accelerator hardware template.
  • Figure 4: State and scheduling order of the computation execution graph after being partitioned into subgraphs.
  • Figure 5: An example of a mapping encoding. The workload in this example is a neural network with 4 GEMMs, where the batch size is 8 with variable sequence lengths. This workload is to be mapped onto an accelerator with 4 chiplets. The mapping encoding provides a feasible mapping scheme. The lower-left portion shows the computation execution graph corresponding to the mapping encoding and the spatio-temporal diagram of the actual execution process. The right portion presents the mapping encoding representations and spatio-temporal execution diagrams of three common parallelism paradigms.
  • ...and 4 more figures