Table of Contents
Fetching ...

DirMoE: Dirichlet-routed Mixture of Experts

Amirhossein Vahidi, Hesam Asadollahzadeh, Navid Akhavan Attar, Marie Moullet, Kevin Ly, Xingyi Yang, Mohammad Lotfollahi

TL;DR

DirMoE introduces a fully differentiable Dirichlet-based router for Mixture-of-Experts that separates per-token expert selection from per-expert contribution. It uses a spike-and-slab factorization on the simplex, with a Gumbel-Sigmoid gate for activation and an implicit reparameterization of the Dirichlet for mass allocation, enabling end-to-end gradients and explicit sparsity control through a sparsity knob $\lambda$. The approach yields calibrated, sparse routing without balancing losses, achieving competitive zero-shot performance and improved expert specialization on large-scale language tasks with a 185M-parameter backbone. Empirically, DirMoE demonstrates scalable training, controllable sparsity via the Dirichlet concentration, and robust specialization across domains on The Pile-based pretraining, making it practical for large sparse MoE deployments.

Abstract

Mixture-of-Experts (MoE) models have demonstrated exceptional performance in large-scale language models. Existing routers typically rely on non-differentiable Top-$k$+Softmax, limiting their performance and scalability. We argue that two distinct decisions, which experts to activate and how to distribute expert contributions among them, are conflated in standard Top-$k$+Softmax. We introduce Dirichlet-Routed MoE (DirMoE), a novel end-to-end differentiable routing mechanism built on a Dirichlet variational autoencoder framework. This design fundamentally disentangles the core routing problems: expert selection, modeled by a Bernoulli component, and expert contribution among chosen experts, handled by a Dirichlet component. The entire forward pass remains fully differentiable through the use of Gumbel-Sigmoid relaxation for the expert selection and implicit reparameterization for the Dirichlet distribution. Our training objective, a variational ELBO, includes a direct sparsity penalty that precisely controls the number of active experts in expectation, alongside a schedule for key hyperparameters that guides the model from an exploratory to a definitive routing state. Moreover, our DirMoE router matches or exceeds other methods while improving expert specialization.

DirMoE: Dirichlet-routed Mixture of Experts

TL;DR

DirMoE introduces a fully differentiable Dirichlet-based router for Mixture-of-Experts that separates per-token expert selection from per-expert contribution. It uses a spike-and-slab factorization on the simplex, with a Gumbel-Sigmoid gate for activation and an implicit reparameterization of the Dirichlet for mass allocation, enabling end-to-end gradients and explicit sparsity control through a sparsity knob . The approach yields calibrated, sparse routing without balancing losses, achieving competitive zero-shot performance and improved expert specialization on large-scale language tasks with a 185M-parameter backbone. Empirically, DirMoE demonstrates scalable training, controllable sparsity via the Dirichlet concentration, and robust specialization across domains on The Pile-based pretraining, making it practical for large sparse MoE deployments.

Abstract

Mixture-of-Experts (MoE) models have demonstrated exceptional performance in large-scale language models. Existing routers typically rely on non-differentiable Top-+Softmax, limiting their performance and scalability. We argue that two distinct decisions, which experts to activate and how to distribute expert contributions among them, are conflated in standard Top-+Softmax. We introduce Dirichlet-Routed MoE (DirMoE), a novel end-to-end differentiable routing mechanism built on a Dirichlet variational autoencoder framework. This design fundamentally disentangles the core routing problems: expert selection, modeled by a Bernoulli component, and expert contribution among chosen experts, handled by a Dirichlet component. The entire forward pass remains fully differentiable through the use of Gumbel-Sigmoid relaxation for the expert selection and implicit reparameterization for the Dirichlet distribution. Our training objective, a variational ELBO, includes a direct sparsity penalty that precisely controls the number of active experts in expectation, alongside a schedule for key hyperparameters that guides the model from an exploratory to a definitive routing state. Moreover, our DirMoE router matches or exceeds other methods while improving expert specialization.
Paper Structure (50 sections, 3 theorems, 34 equations, 9 figures, 4 tables, 1 algorithm)

This paper contains 50 sections, 3 theorems, 34 equations, 9 figures, 4 tables, 1 algorithm.

Key Result

Lemma 1

Let $\mathbf p \sim \mathrm{Dir}(\lambda\,\boldsymbol{\beta})$ with $\boldsymbol{\beta}\in\mathbb{R}_{>0}^E$, $B=\sum_{i=1}^E \beta_i$, and $S_2=\sum_{i=1}^E \beta_i^2$. Then

Figures (9)

  • Figure 1: Illustration of DirMoE . Given a batch $\mathcal{X}$ of input embeddings, two different heads $\alpha_{\mathrm{hi}}(x),\alpha_{\mathrm{lo}}(x)$ learns the active and inactive per-token expert concentration, and $\ell(x)$ learns the gating logits. The routing probabilities are the normalized product of $z$ expert selection and Dirichlet probabilities $\theta$ (expert contribution).
  • Figure 2: Effect of sparsity regularization on the sparsity, there is less sparsity compared to the desired sparsity with lower $\lambda_{sparsity}$.
  • Figure 3: Effect of $m$ and $\lambda$ on (a) Simpson index and (b) sparsity.
  • Figure 4: Effect of active experts ($k$) on (a) LLM loss and (b) sparsity; effect of number of experts (E) on (c) LLM loss.
  • Figure 5: Domain specialization of (a) DirMoE and (b) Vanilla MoE in different layers and domains based on average routed token. The grey dashed line indicates the uniform distribution.
  • ...and 4 more figures

Theorems & Definitions (5)

  • Lemma 1: Expected Simpson index under Dirichlet
  • Theorem 1: Monotone sparsity control by concentration
  • Corollary 1: Symmetric base
  • proof : Proof sketch
  • proof