Table of Contents
Fetching ...

Routing Manifold Alignment Improves Generalization of Mixture-of-Experts LLMs

Zhongyang Li, Ziyue Li, Tianyi Zhou

TL;DR

RoMA tackles the MoE routing generalization gap by aligning the routing-weight manifold with the task-embedding manifold through a manifold regularization term applied during lightweight router finetuning. By selectively imitating routing patterns from successful neighbors in the task-embedding space and constraining routing weight similarity across semantically related samples, RoMA bridges the misalignment between task understanding and expert utilization. The method yields 7–15 percentage-point accuracy gains across eight benchmarks on three MoE LLMs while maintaining base-model inference efficiency, and enables small active-parameter MoEs to rival much larger dense models. This geometry-driven approach highlights the importance of task-expert coupling in routing and points to a practical, data-efficient path for improving MoE generalization in large-scale LLMs.

Abstract

Sparse Mixture-of-Experts (MoE) have been widely adopted in recent large language models since it can efficiently scale up the model capability without increasing the inference cost. However, evaluations on broad downstream tasks reveal a consistent suboptimality of the routers in existing MoE LLMs, which results in a severe performance gap (e.g., 10-20% in accuracy) to the optimal routing. In this paper, we show that aligning the manifold of routing weights with that of task embedding can effectively reduce the gap and improve MoE LLMs' generalization performance. Our method, "Routing Manifold Alignment (RoMA)", introduces an additional manifold regularization term in the post-training objective and only requires lightweight finetuning of routers (with other parameters frozen). Specifically, the regularization encourages the routing weights of each sample to be close to those of its successful neighbors (whose routing weights lead to correct answers) in a task embedding space. Consequently, samples targeting similar tasks will share similar expert choices across layers. Building such bindings between tasks and experts over different samples is essential to achieve better generalization. Moreover, RoMA demonstrates the advantage of unifying the task understanding (by embedding models) with solution generation (by MoE LLMs). In experiments, we finetune routers in OLMoE, DeepSeekMoE, and Qwen3-MoE using RoMA. Evaluations on diverse benchmarks and extensive comparisons with baselines show the substantial improvement brought by RoMA.

Routing Manifold Alignment Improves Generalization of Mixture-of-Experts LLMs

TL;DR

RoMA tackles the MoE routing generalization gap by aligning the routing-weight manifold with the task-embedding manifold through a manifold regularization term applied during lightweight router finetuning. By selectively imitating routing patterns from successful neighbors in the task-embedding space and constraining routing weight similarity across semantically related samples, RoMA bridges the misalignment between task understanding and expert utilization. The method yields 7–15 percentage-point accuracy gains across eight benchmarks on three MoE LLMs while maintaining base-model inference efficiency, and enables small active-parameter MoEs to rival much larger dense models. This geometry-driven approach highlights the importance of task-expert coupling in routing and points to a practical, data-efficient path for improving MoE generalization in large-scale LLMs.

Abstract

Sparse Mixture-of-Experts (MoE) have been widely adopted in recent large language models since it can efficiently scale up the model capability without increasing the inference cost. However, evaluations on broad downstream tasks reveal a consistent suboptimality of the routers in existing MoE LLMs, which results in a severe performance gap (e.g., 10-20% in accuracy) to the optimal routing. In this paper, we show that aligning the manifold of routing weights with that of task embedding can effectively reduce the gap and improve MoE LLMs' generalization performance. Our method, "Routing Manifold Alignment (RoMA)", introduces an additional manifold regularization term in the post-training objective and only requires lightweight finetuning of routers (with other parameters frozen). Specifically, the regularization encourages the routing weights of each sample to be close to those of its successful neighbors (whose routing weights lead to correct answers) in a task embedding space. Consequently, samples targeting similar tasks will share similar expert choices across layers. Building such bindings between tasks and experts over different samples is essential to achieve better generalization. Moreover, RoMA demonstrates the advantage of unifying the task understanding (by embedding models) with solution generation (by MoE LLMs). In experiments, we finetune routers in OLMoE, DeepSeekMoE, and Qwen3-MoE using RoMA. Evaluations on diverse benchmarks and extensive comparisons with baselines show the substantial improvement brought by RoMA.

Paper Structure

This paper contains 28 sections, 9 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: RoMA on OLMoE-7B-A1B vs. 7-34B dense LLMs across eight benchmarks. RoMA leads to 7-15% accuracy improvement, consistently outperforming all models over eight benchmarks, demonstrating the effectiveness of post-training by RoMA.
  • Figure 2: Overview of RoMA. RoMA finetunes routers in MoE LLM (bottom, yellow) with a training objective defined on each sample $(x_i, y_i)$, which is composed of (1) the task loss $\mathcal{L}_{\text{task}}(i)$ defined on the model output $f(x_i,r_i)$; and (2) the manifold alignment regularization $\mathcal{L}_{\text{manifold}}(i)$, which aligns the manifolds of routing weights (right, green) and the task embedding (left, blue). It improves MoE's generalization by unifying solution generation in MoE with task understanding.
  • Figure 3: UMAP visualization of task embedding and routing weights manifolds for samples in ARC-C. (a) Their task embedding shows cluster structures. (b) Routing weights by pretrained MoE are scattered and misaligned with the task embedding clusters. (c) RoMA aligns routing weights with the task embedding manifold's cluster structure. (d) RoMA also achieves a similar manifold structure as that of the optimal routing weights (oracle), which explains the improvement in generalization.
  • Figure 4: Training set statistics.
  • Figure 5: Performance and inference cost of OLMoE (base model), OLMoE + C3PO and OLMoE + RoMA across eight benchmarks. (a) Accuracy: RoMA consistently improves the base model's performance to be comparable or better than C3PO. (b) Inference cost (in FLOPs $\times 10^{11}$): RoMA maintains nearly the same efficiency as the base model, while C3PO requires test-time optimization and induces $6$--$7\times$ more FLOPs. These results highlight the effectiveness and efficiency of RoMA.
  • ...and 10 more figures