Dirichlet-Prior Shaping: Guiding Expert Specialization in Upcycled MoEs
Leyla Mirvakhabova, Babak Ehteshami Bejnordi, Gaurav Kumar, Hanxue Liang, Wanru Zhao, Paul Whatmough
TL;DR
This work tackles poor expert specialization when upcycling dense models into sparse Mixture-of-Experts by introducing Dirichlet-Prior Shaping Loss (DPSL), a router regularizer that aligns categorical routing with a target Dirichlet prior via per-category Beta marginals $p_k\sim\mathrm{Beta}(\alpha_k, A-\alpha_k)$. By selecting symmetric or asymmetric priors, DPSL enables balanced or modality-/task-specific specialization without manual pretraining, and it extends to any module producing categorical distributions. Empirical validation on upcycled Vision-Language MoEs with backbones such as $\text{Qwen2}$, $\text{Phi3}$, and $\text{Llama3.2}$ demonstrates consistent, superior downstream performance across six benchmarks, outperforming upcycling baselines, regularizers, and BTX-style approaches. The results show DPSL fosters more adaptive, end-to-end specialized routing, providing a practical, generalizable pathway to higher-capacity, efficient multimodal models.
Abstract
Upcycling pre-trained dense models into sparse Mixture-of-Experts (MoEs) efficiently increases model capacity but often suffers from poor expert specialization due to naive weight replication. Our analysis reveals that upcycled MoEs, even with conventional regularization, exhibit low-confidence, weakly differentiated routing, hindering performance. We introduce Dirichlet-Prior Shaping Loss (DPSL), a novel router regularization technique that directly shapes routing probability distributions by matching expert assignments to a target Dirichlet prior. DPSL offers fine-grained control over expert balance and specialization, and enables encoding of inductive biases such as encouraging experts to focus on specific modalities or tasks, without requiring manual intervention; notably, DPSL is a general tool applicable to any module that outputs categorical probability distributions, extending its utility beyond MoE training. Experiments on upcycled MoE vision-language models (with Qwen2, Phi3, Llama3.2 LLM backbones) show DPSL consistently outperforms upcycling strategies and regularization techniques across standard vision-language benchmarks, addressing the critical issue of poor specialization and fostering more adaptive, higher-performing models.
