Table of Contents
Fetching ...

OmniMoE: An Efficient MoE by Orchestrating Atomic Experts at Scale

Jingze Shi, Zhangyang Peng, Yizhang Zhu, Yifan Wu, Guang Liu, Yuyu Luo

TL;DR

OmniMoE addresses the conflict between fine-grained expert expressivity and hardware efficiency in MoEs by integrating a universally activated shared MLP with a massive pool of atomic experts, assembled dynamically for each token. It introduces a Cartesian Product Router to reduce routing complexity from $O(N)$ to about $O( ext{sqrt}(N))$ and an Expert-Centric Scheduling scheme to transform scattered memory accesses into dense, reusable GEMMs. In extensive experiments across seven benchmarks and multiple scales, OmniMoE achieves a 50.9 average zero-shot score with 1.7B active parameters and delivers up to a 10.9× inference speedup over state-of-the-art fine-grained baselines, while maintaining competitive memory usage. This holistic co-design demonstrates that massive-scale fine-grained MoEs can be both highly accurate and practically fast, and the authors provide open-source code for broader adoption.

Abstract

Mixture-of-Experts (MoE) architectures are evolving towards finer granularity to improve parameter efficiency. However, existing MoE designs face an inherent trade-off between the granularity of expert specialization and hardware execution efficiency. We propose OmniMoE, a system-algorithm co-designed framework that pushes expert granularity to its logical extreme. OmniMoE introduces vector-level Atomic Experts, enabling scalable routing and execution within a single MoE layer, while retaining a shared dense MLP branch for general-purpose processing. Although this atomic design maximizes capacity, it poses severe challenges for routing complexity and memory access. To address these, OmniMoE adopts a system-algorithm co-design: (i) a Cartesian Product Router that decomposes the massive index space to reduce routing complexity from O(N) to O(sqrt(N)); and (ii) Expert-Centric Scheduling that inverts the execution order to turn scattered, memory-bound lookups into efficient dense matrix operations. Validated on seven benchmarks, OmniMoE (with 1.7B active parameters) achieves 50.9% zero-shot accuracy across seven benchmarks, outperforming coarse-grained (e.g., DeepSeekMoE) and fine-grained (e.g., PEER) baselines. Crucially, OmniMoE reduces inference latency from 73ms to 6.7ms (a 10.9-fold speedup) compared to PEER, demonstrating that massive-scale fine-grained MoE can be fast and accurate. Our code is open-sourced at https://github.com/flash-algo/omni-moe.

OmniMoE: An Efficient MoE by Orchestrating Atomic Experts at Scale

TL;DR

OmniMoE addresses the conflict between fine-grained expert expressivity and hardware efficiency in MoEs by integrating a universally activated shared MLP with a massive pool of atomic experts, assembled dynamically for each token. It introduces a Cartesian Product Router to reduce routing complexity from to about and an Expert-Centric Scheduling scheme to transform scattered memory accesses into dense, reusable GEMMs. In extensive experiments across seven benchmarks and multiple scales, OmniMoE achieves a 50.9 average zero-shot score with 1.7B active parameters and delivers up to a 10.9× inference speedup over state-of-the-art fine-grained baselines, while maintaining competitive memory usage. This holistic co-design demonstrates that massive-scale fine-grained MoEs can be both highly accurate and practically fast, and the authors provide open-source code for broader adoption.

Abstract

Mixture-of-Experts (MoE) architectures are evolving towards finer granularity to improve parameter efficiency. However, existing MoE designs face an inherent trade-off between the granularity of expert specialization and hardware execution efficiency. We propose OmniMoE, a system-algorithm co-designed framework that pushes expert granularity to its logical extreme. OmniMoE introduces vector-level Atomic Experts, enabling scalable routing and execution within a single MoE layer, while retaining a shared dense MLP branch for general-purpose processing. Although this atomic design maximizes capacity, it poses severe challenges for routing complexity and memory access. To address these, OmniMoE adopts a system-algorithm co-design: (i) a Cartesian Product Router that decomposes the massive index space to reduce routing complexity from O(N) to O(sqrt(N)); and (ii) Expert-Centric Scheduling that inverts the execution order to turn scattered, memory-bound lookups into efficient dense matrix operations. Validated on seven benchmarks, OmniMoE (with 1.7B active parameters) achieves 50.9% zero-shot accuracy across seven benchmarks, outperforming coarse-grained (e.g., DeepSeekMoE) and fine-grained (e.g., PEER) baselines. Crucially, OmniMoE reduces inference latency from 73ms to 6.7ms (a 10.9-fold speedup) compared to PEER, demonstrating that massive-scale fine-grained MoE can be fast and accurate. Our code is open-sourced at https://github.com/flash-algo/omni-moe.
Paper Structure (20 sections, 22 equations, 7 figures, 4 tables)

This paper contains 20 sections, 22 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Activation Patterns and System Optimization. (a) Coarse-grained MoE activates large experts, inevitably involving redundant parameters and wasting computation. (b) Fine-grained MoE improves parameter efficiency, but suffers from bandwidth bottlenecks due to scattered, fragmented memory accesses. (c) Our OmniMoE employs a universally activated shared dense MLP, and uses expert-centric scheduling to reorganize fine-grained expert fetches into contiguous, coalesced memory accesses, achieving both high parameter efficiency and hardware-efficient execution.
  • Figure 2: Overview of the OmniMoE Architecture. The framework operates via two parallel pathways to balance efficiency and expressivity. (a) Dynamic Expert Assembly (Top): For Longtail Knowledge Retrieval objective, we employ a Cartesian Product Router (decomposed into Row/Column routers) to efficiently compute routing scores $\mathbf{g}_x$ and identify the top-$K$ expert indices $\mathcal{I}_x$. Then the system dynamically retrieves specific parameter slices from the global matrices $W, V$ to assemble compact, token-dependent parameter blocks $w_x, v_x$ for the final gated projection. (b) Shared Expert (Bottom): A dense MLP which is always active to handling General Semantics. The final output is obtained by aggregating the outputs from the sparse, routed branch and the shared dense branch.
  • Figure 3: Comparison of Execution Paradigms: Token-Centric vs. Expert-Centric Scheduling.(a) Conventional: Tokens independently fetch parameters from scattered experts, leading to random memory accesses (high load overhead) and fragmented vector-vector computations that underutilize on-chip SMs. (b) Our Approach: We invert the execution order using expert-centric scheduling. Left-to-Right: First, tasks are reordered: we compress active experts into dense groups (e.g., experts 0--3 are grouped into Group 1) and sort tasks by Token ID within each group. Matrix Fusion: This reorganization allows us to merge individual token-expert pairs into dense tensors. Instead of scattered ops, the GPU executes efficient Grouped GEMM kernels (rightmost block), where a block of expert weights is loaded once and reused across stacked tokens, maximizing Tensor Core utilization and memory bandwidth.
  • Figure 4: End-to-End Efficiency Comparison. (a, b) Inference latency and (c, d) peak memory versus activated parameters (left column) and input token count (right column). Baselines include Dense, Gshard, DeepSeekMoE, PKM, and PEER. OmniMoE achieves consistently lower latency than DeepSeekMoE and fine-grained baselines (PKM/PEER), while maintaining a peak memory footprint comparable to coarse-grained MoEs.
  • Figure 5: Scaling Laws. Validation perplexity (lower is better) versus (a) training FLOPs and (b) activated parameters. OmniMoE consistently outperforms all baselines, achieving the best trade-off between model quality and computational cost.
  • ...and 2 more figures