Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels

Mingcong Song; Xinru Tang; Fengfan Hou; Jing Li; Wei Wei; Yipeng Ma; Runqiu Xiao; Hongjie Si; Dingcheng Jiang; Shouyi Yin; Yang Hu; Guoping Long

Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels

Mingcong Song, Xinru Tang, Fengfan Hou, Jing Li, Wei Wei, Yipeng Ma, Runqiu Xiao, Hongjie Si, Dingcheng Jiang, Shouyi Yin, Yang Hu, Guoping Long

TL;DR

This work tackles the challenge of dynamic workloads in production LLM inference by introducing XY-Serve, a system that maps variable P/D/V stage workloads onto hardware-friendly meta-primitives via token-wise scheduling and meta-kernel design. The approach couples dynamic task decomposition with reordering and two novel meta-kernels—Meta-Attention and SmoothGEMM—to handle diverse attention masks and dynamic GEMM shapes without incurring padding overhead. Empirical results show up to 89% end-to-end QPS improvement and strong kernel-level gains (e.g., 21.5% average faster attention, 14.6% faster GEMM) on Ascend NPUs, with performance rivaling GPU baselines on end-to-end MFU/MBU. Overall, XY-Serve demonstrates how hardware-aware abstractions and offline profiling can sustain high efficiency under realistic, dynamic workloads in production-grade LLM serving.

Abstract

Meeting growing demands for low latency and cost efficiency in production-grade large language model (LLM) serving systems requires integrating advanced optimization techniques. However, dynamic and unpredictable input-output lengths of LLM, compounded by these optimizations, exacerbate the issues of workload variability, making it difficult to maintain high efficiency on AI accelerators, especially DSAs with tile-based programming models. To address this challenge, we introduce XY-Serve, a versatile, Ascend native, end-to-end production LLM-serving system. The core idea is an abstraction mechanism that smooths out the workload variability by decomposing computations into unified, hardware-friendly, fine-grained meta primitives. For attention, we propose a meta-kernel that computes the basic pattern of matmul-softmax-matmul with architectural-aware tile sizes. For GEMM, we introduce a virtual padding scheme that adapts to dynamic shape changes while using highly efficient GEMM primitives with assorted fixed tile sizes. XY-Serve sits harmoniously with vLLM. Experimental results show up to 89% end-to-end throughput improvement compared with current publicly available baselines on Ascend NPUs. Additionally, our approach outperforms existing GEMM (average 14.6% faster) and attention (average 21.5% faster) kernels relative to existing libraries. While the work is Ascend native, we believe the approach can be readily applicable to SIMT architectures as well.

Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels

TL;DR

Abstract

Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (19)