DirMoE: Dirichlet-routed Mixture of Experts

Amirhossein Vahidi; Hesam Asadollahzadeh; Navid Akhavan Attar; Marie Moullet; Kevin Ly; Xingyi Yang; Mohammad Lotfollahi

DirMoE: Dirichlet-routed Mixture of Experts

Amirhossein Vahidi, Hesam Asadollahzadeh, Navid Akhavan Attar, Marie Moullet, Kevin Ly, Xingyi Yang, Mohammad Lotfollahi

TL;DR

DirMoE introduces a fully differentiable Dirichlet-based router for Mixture-of-Experts that separates per-token expert selection from per-expert contribution. It uses a spike-and-slab factorization on the simplex, with a Gumbel-Sigmoid gate for activation and an implicit reparameterization of the Dirichlet for mass allocation, enabling end-to-end gradients and explicit sparsity control through a sparsity knob $\lambda$. The approach yields calibrated, sparse routing without balancing losses, achieving competitive zero-shot performance and improved expert specialization on large-scale language tasks with a 185M-parameter backbone. Empirically, DirMoE demonstrates scalable training, controllable sparsity via the Dirichlet concentration, and robust specialization across domains on The Pile-based pretraining, making it practical for large sparse MoE deployments.

Abstract

Mixture-of-Experts (MoE) models have demonstrated exceptional performance in large-scale language models. Existing routers typically rely on non-differentiable Top-$k$+Softmax, limiting their performance and scalability. We argue that two distinct decisions, which experts to activate and how to distribute expert contributions among them, are conflated in standard Top-$k$+Softmax. We introduce Dirichlet-Routed MoE (DirMoE), a novel end-to-end differentiable routing mechanism built on a Dirichlet variational autoencoder framework. This design fundamentally disentangles the core routing problems: expert selection, modeled by a Bernoulli component, and expert contribution among chosen experts, handled by a Dirichlet component. The entire forward pass remains fully differentiable through the use of Gumbel-Sigmoid relaxation for the expert selection and implicit reparameterization for the Dirichlet distribution. Our training objective, a variational ELBO, includes a direct sparsity penalty that precisely controls the number of active experts in expectation, alongside a schedule for key hyperparameters that guides the model from an exploratory to a definitive routing state. Moreover, our DirMoE router matches or exceeds other methods while improving expert specialization.

DirMoE: Dirichlet-routed Mixture of Experts

TL;DR

. The approach yields calibrated, sparse routing without balancing losses, achieving competitive zero-shot performance and improved expert specialization on large-scale language tasks with a 185M-parameter backbone. Empirically, DirMoE demonstrates scalable training, controllable sparsity via the Dirichlet concentration, and robust specialization across domains on The Pile-based pretraining, making it practical for large sparse MoE deployments.

Abstract

Mixture-of-Experts (MoE) models have demonstrated exceptional performance in large-scale language models. Existing routers typically rely on non-differentiable Top-

+Softmax, limiting their performance and scalability. We argue that two distinct decisions, which experts to activate and how to distribute expert contributions among them, are conflated in standard Top-

+Softmax. We introduce Dirichlet-Routed MoE (DirMoE), a novel end-to-end differentiable routing mechanism built on a Dirichlet variational autoencoder framework. This design fundamentally disentangles the core routing problems: expert selection, modeled by a Bernoulli component, and expert contribution among chosen experts, handled by a Dirichlet component. The entire forward pass remains fully differentiable through the use of Gumbel-Sigmoid relaxation for the expert selection and implicit reparameterization for the Dirichlet distribution. Our training objective, a variational ELBO, includes a direct sparsity penalty that precisely controls the number of active experts in expectation, alongside a schedule for key hyperparameters that guides the model from an exploratory to a definitive routing state. Moreover, our DirMoE router matches or exceeds other methods while improving expert specialization.

Paper Structure (50 sections, 3 theorems, 34 equations, 9 figures, 4 tables, 1 algorithm)

This paper contains 50 sections, 3 theorems, 34 equations, 9 figures, 4 tables, 1 algorithm.

Introduction
Our contributions are:
Related Work
Mixture-of-Experts and Routing
Sparsity in MoE
Preliminary
Spike and Slab Routing on the Simplex
Simpson Index as a Sparsity Metric
Method
Problem Setup
Differentiable router
Training Objective
Sparsity regularization.
Scheduler
Temperature schedule.
...and 35 more sections

Key Result

Lemma 1

Let $\mathbf p \sim \mathrm{Dir}(\lambda\,\boldsymbol{\beta})$ with $\boldsymbol{\beta}\in\mathbb{R}_{>0}^E$, $B=\sum_{i=1}^E \beta_i$, and $S_2=\sum_{i=1}^E \beta_i^2$. Then

Figures (9)

Figure 1: Illustration of DirMoE . Given a batch $\mathcal{X}$ of input embeddings, two different heads $\alpha_{\mathrm{hi}}(x),\alpha_{\mathrm{lo}}(x)$ learns the active and inactive per-token expert concentration, and $\ell(x)$ learns the gating logits. The routing probabilities are the normalized product of $z$ expert selection and Dirichlet probabilities $\theta$ (expert contribution).
Figure 2: Effect of sparsity regularization on the sparsity, there is less sparsity compared to the desired sparsity with lower $\lambda_{sparsity}$.
Figure 3: Effect of $m$ and $\lambda$ on (a) Simpson index and (b) sparsity.
Figure 4: Effect of active experts ($k$) on (a) LLM loss and (b) sparsity; effect of number of experts (E) on (c) LLM loss.
Figure 5: Domain specialization of (a) DirMoE and (b) Vanilla MoE in different layers and domains based on average routed token. The grey dashed line indicates the uniform distribution.
...and 4 more figures

Theorems & Definitions (5)

Lemma 1: Expected Simpson index under Dirichlet
Theorem 1: Monotone sparsity control by concentration
Corollary 1: Symmetric base
proof : Proof sketch
proof

DirMoE: Dirichlet-routed Mixture of Experts

TL;DR

Abstract

DirMoE: Dirichlet-routed Mixture of Experts

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (5)