Table of Contents
Fetching ...

Dirichlet-Prior Shaping: Guiding Expert Specialization in Upcycled MoEs

Leyla Mirvakhabova, Babak Ehteshami Bejnordi, Gaurav Kumar, Hanxue Liang, Wanru Zhao, Paul Whatmough

TL;DR

This work tackles poor expert specialization when upcycling dense models into sparse Mixture-of-Experts by introducing Dirichlet-Prior Shaping Loss (DPSL), a router regularizer that aligns categorical routing with a target Dirichlet prior via per-category Beta marginals $p_k\sim\mathrm{Beta}(\alpha_k, A-\alpha_k)$. By selecting symmetric or asymmetric priors, DPSL enables balanced or modality-/task-specific specialization without manual pretraining, and it extends to any module producing categorical distributions. Empirical validation on upcycled Vision-Language MoEs with backbones such as $\text{Qwen2}$, $\text{Phi3}$, and $\text{Llama3.2}$ demonstrates consistent, superior downstream performance across six benchmarks, outperforming upcycling baselines, regularizers, and BTX-style approaches. The results show DPSL fosters more adaptive, end-to-end specialized routing, providing a practical, generalizable pathway to higher-capacity, efficient multimodal models.

Abstract

Upcycling pre-trained dense models into sparse Mixture-of-Experts (MoEs) efficiently increases model capacity but often suffers from poor expert specialization due to naive weight replication. Our analysis reveals that upcycled MoEs, even with conventional regularization, exhibit low-confidence, weakly differentiated routing, hindering performance. We introduce Dirichlet-Prior Shaping Loss (DPSL), a novel router regularization technique that directly shapes routing probability distributions by matching expert assignments to a target Dirichlet prior. DPSL offers fine-grained control over expert balance and specialization, and enables encoding of inductive biases such as encouraging experts to focus on specific modalities or tasks, without requiring manual intervention; notably, DPSL is a general tool applicable to any module that outputs categorical probability distributions, extending its utility beyond MoE training. Experiments on upcycled MoE vision-language models (with Qwen2, Phi3, Llama3.2 LLM backbones) show DPSL consistently outperforms upcycling strategies and regularization techniques across standard vision-language benchmarks, addressing the critical issue of poor specialization and fostering more adaptive, higher-performing models.

Dirichlet-Prior Shaping: Guiding Expert Specialization in Upcycled MoEs

TL;DR

This work tackles poor expert specialization when upcycling dense models into sparse Mixture-of-Experts by introducing Dirichlet-Prior Shaping Loss (DPSL), a router regularizer that aligns categorical routing with a target Dirichlet prior via per-category Beta marginals . By selecting symmetric or asymmetric priors, DPSL enables balanced or modality-/task-specific specialization without manual pretraining, and it extends to any module producing categorical distributions. Empirical validation on upcycled Vision-Language MoEs with backbones such as , , and demonstrates consistent, superior downstream performance across six benchmarks, outperforming upcycling baselines, regularizers, and BTX-style approaches. The results show DPSL fosters more adaptive, end-to-end specialized routing, providing a practical, generalizable pathway to higher-capacity, efficient multimodal models.

Abstract

Upcycling pre-trained dense models into sparse Mixture-of-Experts (MoEs) efficiently increases model capacity but often suffers from poor expert specialization due to naive weight replication. Our analysis reveals that upcycled MoEs, even with conventional regularization, exhibit low-confidence, weakly differentiated routing, hindering performance. We introduce Dirichlet-Prior Shaping Loss (DPSL), a novel router regularization technique that directly shapes routing probability distributions by matching expert assignments to a target Dirichlet prior. DPSL offers fine-grained control over expert balance and specialization, and enables encoding of inductive biases such as encouraging experts to focus on specific modalities or tasks, without requiring manual intervention; notably, DPSL is a general tool applicable to any module that outputs categorical probability distributions, extending its utility beyond MoE training. Experiments on upcycled MoE vision-language models (with Qwen2, Phi3, Llama3.2 LLM backbones) show DPSL consistently outperforms upcycling strategies and regularization techniques across standard vision-language benchmarks, addressing the critical issue of poor specialization and fostering more adaptive, higher-performing models.

Paper Structure

This paper contains 37 sections, 13 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Sparse upcycling (left) initializes identical experts, yielding homogeneous routing probabilities and limited specialization. (right) Our proposed Dirichlet-Prior Shaping Loss guides routing towards desired distributions fostering balanced and confident selection (via symmetric priors) or targeted, modality-/task-aware specialization (via asymmetric priors).
  • Figure 2: Dirichlet-Prior Shaping Loss (DPSL) shapes categorical probability distributions from two data sources (S1, S2). Top and middle rows show the empirical (dashed) vs. target (solid) CDFs for each category, at initialization and after convergence, respectively, along with simplex of assignment probabilities. Bottom row presents data histograms of assignment probabilities overlaid with target Beta PDFs, and learning curves showing DPSL minimization during training.
  • Figure 3: Router output distributions for three experts in an upcycled MoE with top-1 routing. Each panel shows the simplex of routing probabilities under (a) no regularization, (b) z-loss, (c) load-balancing loss, and (d) Dirichlet-Prior Shaping Loss (symmetric prior).
  • Figure 4: Visualization of the marginal Beta distributions for the following Dirichlet distributions: ---$\operatorname{Dir}(5.0, 5.0, 5.0)$, ---$\operatorname{Dir}(0.2, 0.2, 0.2)$, ---$\operatorname{Dir}(1.0, 1.0, 1.0)$, and ---$\operatorname{Dir}(0.75, 0.1, 1.25)$.
  • Figure 5: Dirichlet-Prior Shaping Loss (DPSL) shapes categorical probability distributions from two data sources (S1, S2). Top row shows the empirical (dashed) vs. target (solid) CDFs for each category after convergence, along with simplex of assignment probabilities. Bottom row presents data histograms of assignment probabilities overlaid with target Beta PDFs, and learning curves showing DPSL minimization during training.
  • ...and 1 more figures