On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions

Huy Nguyen; Xing Han; Carl Harris; Suchi Saria; Nhat Ho

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions

Huy Nguyen, Xing Han, Carl Harris, Suchi Saria, Nhat Ho

TL;DR

The paper analyzes hierarchical MoE (HMoE) models under three gating configurations (SS, SL, LL) and shows that Laplace gating, especially when used at both levels (LL), eliminates key parameter interactions responsible for slow convergence in Softmax-based gates. It establishes density-estimation rates via Hellinger distance and derives Voronoi-loss-based bounds to characterize parameter and expert estimation convergence, revealing that LL gating provides the fastest and most robust expert specialization. Extensive experiments on multimodal and multi-domain tasks—including MIMIC-based clinical fusion and CMU-MOSI sentiment analysis—demonstrate that Laplace-based gating consistently improves performance and routing diversity, with LL often achieving the best results. The work offers theoretical and empirical guidance for gating choices in HMoE, highlighting practical gains in complex, heterogeneous data settings and outlining future directions for model order estimation and broader gating-function exploration.

Abstract

With the growing prominence of the Mixture of Experts (MoE) architecture in developing large-scale foundation models, we investigate the Hierarchical Mixture of Experts (HMoE), a specialized variant of MoE that excels in handling complex inputs and improving performance on targeted tasks. Our analysis highlights the advantages of using the Laplace gating function over the traditional Softmax gating within the HMoE frameworks. We theoretically demonstrate that applying the Laplace gating function at both levels of the HMoE model helps eliminate undesirable parameter interactions caused by the Softmax gating and, therefore, accelerates the expert convergence as well as enhances the expert specialization. Empirical validation across diverse scenarios supports these theoretical claims. This includes large-scale multimodal tasks, image classification, and latent domain discovery and prediction tasks, where our modified HMoE models show great performance improvements compared to the conventional HMoE models.

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions

TL;DR

Abstract

Paper Structure (33 sections, 7 theorems, 167 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 33 sections, 7 theorems, 167 equations, 7 figures, 6 tables, 1 algorithm.

Introduction
Preliminaries
Problem Setup
Density Estimation
Convergence Rates of Parameter Estimation and Expert Estimation
Softmax-Softmax Gating Gaussian HMoE
Softmax-Laplace Gating Gaussian HMoE
Laplace-Laplace Gating Gaussian HMoE
Summary of Main Theoretical Findings
Experiments
Comparison of Different Hierarchical Gating Mechanisms
Laplace Gating Mechanism Improves Multimodal Fusion
The MIMIC Ecosystem
CMU-MOSI Dataset
HMoE Naturally Capture Hierarchical Structures in the Data
...and 18 more sections

Key Result

Proposition 1

For each $type\in\{SS,SL,LL\}$, suppose that the equation $p^{type}_{G}(y|\boldsymbol{x})=p^{type}_{G_*}(y|\boldsymbol{x})$ holds true for almost surely $(\boldsymbol{x},y)$, then we get that $G\equiv G_*$.

Figures (7)

Figure 1: Comparison of HMoE and standard MoE in managing multimodal input: MoE excels at processing homogeneous inputs. However, it faces challenges with more intricate structures, such as inputs that can be split into subgroups or those with inherently hierarchical configurations. By contrast, HMoE improves upon this by decomposing tasks into subproblems and directing subsets of data to specialized groups of experts. This approach allows for more granular specialization and enhances the model's capability to handle complex inputs.
Figure 2: Illustration of Voronoi cells defined in equations \ref{['eq:Voronoi_cells_level_1']} and \ref{['eq:Voronoi_cells_level_2']}. In the first level, Voronoi cells $\mathcal{V}_{j_1}$, for $j_1\in[k_1^*]$, are generated by ground-truth first-level parameters $\boldsymbol{a}^*_{j_1}$ (red squares) and contain first-level fitted parameters $\boldsymbol{a}_{i_1}$ (blue stars). Since the value of $k_1^*$ is known, the red squares are exactly fitted, implying that each Voronoi cell $\mathcal{V}_{j_1}$ has only one blue star. In the second level, each gray rectangle depicts a set of $k_2^*=3$ Voronoi cells $\{\mathcal{V}_{j_2|j_1}:j_2\in[k_2^*]\}$ generated by ground-truth second-level parameters $\boldsymbol{\zeta}^*_{j_2|j_1}$ (red triangles), for $j_1\in[k_1^*]$. These three Voronoi cells $\mathcal{V}_{j_2|j_1}$ contain a total of $k_2=5$ second-level fitted parameters $\boldsymbol{\zeta}_{i_2|j_1}$ (blue rounds). Since $k_2>k_2^*$, there exist some Voronoi cells $\mathcal{V}_{j_2|j_1}$ having more than one blue round.
Figure 3: We evaluate the impact of using different gating function combinations in HMoE and compare it with standard MoE on (a) CIFAR-10, (b) ImageNet, and (c) CIFAR-10-Corrupted. First, we present the results of one-layer MoE models (left side of each figure), where the model contains only the module of that specific setting. For the one-layer results, we use Tiny-ImageNet as a substitute for the full ImageNet. Next, we integrate these MoE modules into the state-of-the-art Vision MoE model (right) riquelme2021scaling and compare the performance on the full datasets.
Figure 4: Synthetic experiment illustrating how HMoE more effectively handles data with multi-level structures. Figures (a) and (b) depict the hierarchical target generation process, and (c) shows HMoE’s predictive advantage over MoE.
Figure 5: Synthetic experiment illustrating how HMoE more effectively handles data with multi-level structures. Figures (d)–(f) highlight how HMoE’s coarse-to-fine partitioning of the input space results in stronger expert specialization.
...and 2 more figures

Theorems & Definitions (9)

Proposition 1
Proposition 2
Lemma 1
Theorem 1
Theorem 2
Theorem 3
proof : Proof of Proposition \ref{['prop:density_estimation']}
Lemma 2: Theorem 7.4, vandeGeer-00
proof : Proof of Proposition \ref{['prop:identifiability']}

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions

TL;DR

Abstract

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (9)