Compass: Mapping Space Exploration for Multi-Chiplet Accelerators Targeting LLM Inference Serving Workloads

Boyu Li; Zongwei Zhu; Yi Xiong; Qianyue Cao; Jiawei Geng; Xiaonan Zhang; Xi Li

Compass: Mapping Space Exploration for Multi-Chiplet Accelerators Targeting LLM Inference Serving Workloads

Boyu Li, Zongwei Zhu, Yi Xiong, Qianyue Cao, Jiawei Geng, Xiaonan Zhang, Xi Li

TL;DR

Compass tackles the challenge of mapping space exploration for multi-chiplet accelerators under dynamic LLM inference workloads characterized by mixed request types and variable sequence lengths. It introduces a computation execution graph-based encoding and a GA-driven mapping generation engine within the Compass framework, paired with an evaluation engine that models intra- and inter-chiplet latency and energy via a data-access-flag memory model. The approach enables static mappings tuned to sequence-length distributions and heterogeneous hardware, achieving substantial energy-delay product reductions over state-of-the-art baselines. The work demonstrates improved adaptability across diverse models and serving strategies, and provides an open-source implementation to foster broader adoption. Overall, Compass advances co-optimization of hardware topology and dynamic LLM workloads for efficient serving at scale.

Abstract

Large Language Models (LLMs) impose massive computational demands, driving the need for scalable multi-chiplet accelerators. However, existing mapping space exploration efforts for such accelerators primarily focus on traditional CNN/Transformer workloads and fail to adequately support the dynamic behaviors of mixed request types and variable sequence lengths in real-world LLM inference serving. To bridge this gap, we first propose a computation execution graph-based mapping encoding scheme that decouples micro-batches and layers, enabling fine-grained execution control on heterogeneous chiplets and flexibly representing various parallelism strategies. Second, building upon this scheme, we develop the Compass framework, which integrates an evaluation engine and a genetic algorithm-based mapping generation engine to achieve efficient mapping search. Compared to state-of-the-art works, our solution achieves an average EDP reduction of 63.12%.

Compass: Mapping Space Exploration for Multi-Chiplet Accelerators Targeting LLM Inference Serving Workloads

TL;DR

Abstract

Compass: Mapping Space Exploration for Multi-Chiplet Accelerators Targeting LLM Inference Serving Workloads

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)