Table of Contents
Fetching ...

OmniInfer: System-Wide Acceleration Techniques for Optimizing LLM Serving Throughput and Latency

Jun Wang, Yunxiang Yao, Wenwei Kuang, Runze Mao, Zhenhao Sun, Zhuang Tao, Ziyang Zhang, Dengyu Li, Jiajun Chen, Zhili Wang, Kai Cui, Congzhi Cai, Longwen Lan, Ken Zhang

TL;DR

OmniInfer tackles inefficiencies in large-scale LLM serving by decoupling prefill and decode through disaggregation and introducing three coordinated modules: OmniPlacement for MoE load balancing, OmniAttn for layer-wise sparse attention compression, and OmniProxy for global scheduling. The system, implemented atop vLLM and evaluated on Ascend NPUs, achieves up to 616 QPM, with TPOT reduced by up to 36% and TTFT by up to 38% when using the integrated framework. By combining adaptive disaggregation, inference-only sparsity search, and real-time scheduling decisions, OmniInfer demonstrates substantial end-to-end improvements in throughput and latency. The open-source design provides a practical path to deploying efficient, scalable LLM serving across heterogeneous hardware landscapes.

Abstract

Large Language Models drive a wide range of modern AI applications but impose substantial challenges on large-scale serving systems due to intensive computation, strict latency constraints, and throughput bottlenecks. We introduce OmniInfer, a unified system-level acceleration framework designed to maximize end-to-end serving efficiency through fine-grained optimization of expert placement, cache compression, and scheduling. OmniInfer integrates three complementary components: OmniPlacement for load-aware Mixture-of-Experts scheduling, OmniAttn for sparse attention acceleration, and OmniProxy for disaggregation-aware request scheduling. Built atop vLLM, OmniInfer delivers system-wide performance gains through adaptive resource disaggregation, efficient sparsity exploitation, and global coordination across prefill and decode phases. Evaluated on DeepSeek-R1 within a 10-node Ascend 910C cluster, OmniInfer achieves 616 QPM, where the unified framework reduces TPOT by 36\%, and the superimposition of OmniProxy further slashes TTFT by 38\%. The project is open-sourced at [this https URL](https://gitee.com/omniai/omniinfer).

OmniInfer: System-Wide Acceleration Techniques for Optimizing LLM Serving Throughput and Latency

TL;DR

OmniInfer tackles inefficiencies in large-scale LLM serving by decoupling prefill and decode through disaggregation and introducing three coordinated modules: OmniPlacement for MoE load balancing, OmniAttn for layer-wise sparse attention compression, and OmniProxy for global scheduling. The system, implemented atop vLLM and evaluated on Ascend NPUs, achieves up to 616 QPM, with TPOT reduced by up to 36% and TTFT by up to 38% when using the integrated framework. By combining adaptive disaggregation, inference-only sparsity search, and real-time scheduling decisions, OmniInfer demonstrates substantial end-to-end improvements in throughput and latency. The open-source design provides a practical path to deploying efficient, scalable LLM serving across heterogeneous hardware landscapes.

Abstract

Large Language Models drive a wide range of modern AI applications but impose substantial challenges on large-scale serving systems due to intensive computation, strict latency constraints, and throughput bottlenecks. We introduce OmniInfer, a unified system-level acceleration framework designed to maximize end-to-end serving efficiency through fine-grained optimization of expert placement, cache compression, and scheduling. OmniInfer integrates three complementary components: OmniPlacement for load-aware Mixture-of-Experts scheduling, OmniAttn for sparse attention acceleration, and OmniProxy for disaggregation-aware request scheduling. Built atop vLLM, OmniInfer delivers system-wide performance gains through adaptive resource disaggregation, efficient sparsity exploitation, and global coordination across prefill and decode phases. Evaluated on DeepSeek-R1 within a 10-node Ascend 910C cluster, OmniInfer achieves 616 QPM, where the unified framework reduces TPOT by 36\%, and the superimposition of OmniProxy further slashes TTFT by 38\%. The project is open-sourced at [this https URL](https://gitee.com/omniai/omniinfer).

Paper Structure

This paper contains 32 sections, 9 equations, 1 figure, 3 tables, 2 algorithms.

Figures (1)

  • Figure 1: Structure of OmniInfer system under PD-disaggregated serving.