Routing Manifold Alignment Improves Generalization of Mixture-of-Experts LLMs
Zhongyang Li, Ziyue Li, Tianyi Zhou
TL;DR
RoMA tackles the MoE routing generalization gap by aligning the routing-weight manifold with the task-embedding manifold through a manifold regularization term applied during lightweight router finetuning. By selectively imitating routing patterns from successful neighbors in the task-embedding space and constraining routing weight similarity across semantically related samples, RoMA bridges the misalignment between task understanding and expert utilization. The method yields 7–15 percentage-point accuracy gains across eight benchmarks on three MoE LLMs while maintaining base-model inference efficiency, and enables small active-parameter MoEs to rival much larger dense models. This geometry-driven approach highlights the importance of task-expert coupling in routing and points to a practical, data-efficient path for improving MoE generalization in large-scale LLMs.
Abstract
Sparse Mixture-of-Experts (MoE) have been widely adopted in recent large language models since it can efficiently scale up the model capability without increasing the inference cost. However, evaluations on broad downstream tasks reveal a consistent suboptimality of the routers in existing MoE LLMs, which results in a severe performance gap (e.g., 10-20% in accuracy) to the optimal routing. In this paper, we show that aligning the manifold of routing weights with that of task embedding can effectively reduce the gap and improve MoE LLMs' generalization performance. Our method, "Routing Manifold Alignment (RoMA)", introduces an additional manifold regularization term in the post-training objective and only requires lightweight finetuning of routers (with other parameters frozen). Specifically, the regularization encourages the routing weights of each sample to be close to those of its successful neighbors (whose routing weights lead to correct answers) in a task embedding space. Consequently, samples targeting similar tasks will share similar expert choices across layers. Building such bindings between tasks and experts over different samples is essential to achieve better generalization. Moreover, RoMA demonstrates the advantage of unifying the task understanding (by embedding models) with solution generation (by MoE LLMs). In experiments, we finetune routers in OLMoE, DeepSeekMoE, and Qwen3-MoE using RoMA. Evaluations on diverse benchmarks and extensive comparisons with baselines show the substantial improvement brought by RoMA.
