On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions
Huy Nguyen, Xing Han, Carl Harris, Suchi Saria, Nhat Ho
TL;DR
The paper analyzes hierarchical MoE (HMoE) models under three gating configurations (SS, SL, LL) and shows that Laplace gating, especially when used at both levels (LL), eliminates key parameter interactions responsible for slow convergence in Softmax-based gates. It establishes density-estimation rates via Hellinger distance and derives Voronoi-loss-based bounds to characterize parameter and expert estimation convergence, revealing that LL gating provides the fastest and most robust expert specialization. Extensive experiments on multimodal and multi-domain tasks—including MIMIC-based clinical fusion and CMU-MOSI sentiment analysis—demonstrate that Laplace-based gating consistently improves performance and routing diversity, with LL often achieving the best results. The work offers theoretical and empirical guidance for gating choices in HMoE, highlighting practical gains in complex, heterogeneous data settings and outlining future directions for model order estimation and broader gating-function exploration.
Abstract
With the growing prominence of the Mixture of Experts (MoE) architecture in developing large-scale foundation models, we investigate the Hierarchical Mixture of Experts (HMoE), a specialized variant of MoE that excels in handling complex inputs and improving performance on targeted tasks. Our analysis highlights the advantages of using the Laplace gating function over the traditional Softmax gating within the HMoE frameworks. We theoretically demonstrate that applying the Laplace gating function at both levels of the HMoE model helps eliminate undesirable parameter interactions caused by the Softmax gating and, therefore, accelerates the expert convergence as well as enhances the expert specialization. Empirical validation across diverse scenarios supports these theoretical claims. This includes large-scale multimodal tasks, image classification, and latent domain discovery and prediction tasks, where our modified HMoE models show great performance improvements compared to the conventional HMoE models.
