Table of Contents
Fetching ...

FineRMoE: Dimension Expansion for Finer-Grained Expert with Its Upcycling Approach

Ning Liao, Xiaoxing Wang, Xiaohan Qin, Junchi Yan

Abstract

As revealed by the scaling law of fine-grained MoE, model performance ceases to be improved once the granularity of the intermediate dimension exceeds the optimal threshold, limiting further gains from single-dimension fine-grained design. To address this bottleneck, we propose FineRMoE (FineR-Grained MoE), an architecture that extends fine-grained expert design to both intermediate and output dimensions, aiming to enhance expert specialization beyond the single-dimension limit. We further introduce a bi-level sparse forward computation paradigm and a specialized routing mechanism to govern the activation. In addition, to obviate the prohibitive cost of training FineRMoE from scratch, we devise a generalized upcycling method to build FineRMoE in a cost-effective manner. Extensive experiments demonstrate the superior performance achieved by FineRMoE across ten standard benchmarks. Compared with the strongest baseline, FineRMoE achieves 6 times higher parameter efficiency, 281 times lower prefill latency, and 136 timese higher decoding throughput during inference.

FineRMoE: Dimension Expansion for Finer-Grained Expert with Its Upcycling Approach

Abstract

As revealed by the scaling law of fine-grained MoE, model performance ceases to be improved once the granularity of the intermediate dimension exceeds the optimal threshold, limiting further gains from single-dimension fine-grained design. To address this bottleneck, we propose FineRMoE (FineR-Grained MoE), an architecture that extends fine-grained expert design to both intermediate and output dimensions, aiming to enhance expert specialization beyond the single-dimension limit. We further introduce a bi-level sparse forward computation paradigm and a specialized routing mechanism to govern the activation. In addition, to obviate the prohibitive cost of training FineRMoE from scratch, we devise a generalized upcycling method to build FineRMoE in a cost-effective manner. Extensive experiments demonstrate the superior performance achieved by FineRMoE across ten standard benchmarks. Compared with the strongest baseline, FineRMoE achieves 6 times higher parameter efficiency, 281 times lower prefill latency, and 136 timese higher decoding throughput during inference.
Paper Structure (21 sections, 11 equations, 4 figures, 8 tables, 1 algorithm)

This paper contains 21 sections, 11 equations, 4 figures, 8 tables, 1 algorithm.

Figures (4)

  • Figure 1: The proposed FineRMoE (FineR-grained MoE) architecture. The MoE layer is composed of the shared expert and multiple sparse experts, in which fine-grained design is applied to both the intermediate and output dimensions. The forward process of the sparse experts consists of a sparse sum layer and a sparse concatenation layer. A single router with specially designed routing mechanism simultaneously steers the activation in the two sparse layers.
  • Figure 2: The ablation study on the architecture of FineRMoE based on Qwen2.5-1.5B.
  • Figure 3: The forward computation process of a sequence of tokens in the sparse experts of FineRMoE. For a given input sequence, the router first calculates the set of activated experts for each token. The tokens are then permuted to allow for parallel expert computation. After processing by the experts, the outputs are unpermuted to their original token order. For each token, the outputs from its activated experts are combined via a weighted sum to form dimension-reduced components. Finally, these components are concatenated to produce the final, dimension-restored output.
  • Figure 4: The average similarity among the sparse experts across all layers in the effectiveness validation of finer-grained design. The corresponding models are trained based on Qwen2.5-1.5B on 10B tokens for efficiency.