Table of Contents
Fetching ...

Guided by the Experts: Provable Feature Learning Dynamic of Soft-Routed Mixture-of-Experts

Fangshuo Liao, Anastasios Kyrillidis

TL;DR

This paper develops a theoretical framework for jointly training soft-routed MoEs with non-linear experts in a teacher–student setting, using gradient flow on Gaussian inputs. By leveraging a Hermite expansion of the gating and activation functions, the authors prove a sequential feature-learning regime where student router/expert pairs align with their teacher counterparts in a time scale of O(√d), followed by pruning of redundant experts and a provably convergent fine-tuning stage to zero loss. The work demonstrates that, under moderate over-parameterization, the MoE optimization landscape admits a principled recovery order and a post-hoc pruning strategy that preserves learned components while removing wasteful units. This yields a rigorous bridge between MoE architectural practice and theoretical guarantees, with implications for scaling and pruning in large models. The results rely on Gaussian data and a cubic Hermite activation, but the techniques open paths to broader data distributions and routing schemes, highlighting a principled route to MoE optimization theory.

Abstract

Mixture-of-Experts (MoE) architectures have emerged as a cornerstone of modern AI systems. In particular, MoEs route inputs dynamically to specialized experts whose outputs are aggregated through weighted summation. Despite their widespread application, theoretical understanding of MoE training dynamics remains limited to either separate expert-router optimization or only top-1 routing scenarios with carefully constructed datasets. This paper advances MoE theory by providing convergence guarantees for joint training of soft-routed MoE models with non-linear routers and experts in a student-teacher framework. We prove that, with moderate over-parameterization, the student network undergoes a feature learning phase, where the router's learning process is ``guided'' by the experts, that recovers the teacher's parameters. Moreover, we show that a post-training pruning can effectively eliminate redundant neurons, followed by a provably convergent fine-tuning process that reaches global optimality. To our knowledge, our analysis is the first to bring novel insights in understanding the optimization landscape of the MoE architecture.

Guided by the Experts: Provable Feature Learning Dynamic of Soft-Routed Mixture-of-Experts

TL;DR

This paper develops a theoretical framework for jointly training soft-routed MoEs with non-linear experts in a teacher–student setting, using gradient flow on Gaussian inputs. By leveraging a Hermite expansion of the gating and activation functions, the authors prove a sequential feature-learning regime where student router/expert pairs align with their teacher counterparts in a time scale of O(√d), followed by pruning of redundant experts and a provably convergent fine-tuning stage to zero loss. The work demonstrates that, under moderate over-parameterization, the MoE optimization landscape admits a principled recovery order and a post-hoc pruning strategy that preserves learned components while removing wasteful units. This yields a rigorous bridge between MoE architectural practice and theoretical guarantees, with implications for scaling and pruning in large models. The results rely on Gaussian data and a cubic Hermite activation, but the techniques open paths to broader data distributions and routing schemes, highlighting a principled route to MoE optimization theory.

Abstract

Mixture-of-Experts (MoE) architectures have emerged as a cornerstone of modern AI systems. In particular, MoEs route inputs dynamically to specialized experts whose outputs are aggregated through weighted summation. Despite their widespread application, theoretical understanding of MoE training dynamics remains limited to either separate expert-router optimization or only top-1 routing scenarios with carefully constructed datasets. This paper advances MoE theory by providing convergence guarantees for joint training of soft-routed MoE models with non-linear routers and experts in a student-teacher framework. We prove that, with moderate over-parameterization, the student network undergoes a feature learning phase, where the router's learning process is ``guided'' by the experts, that recovers the teacher's parameters. Moreover, we show that a post-training pruning can effectively eliminate redundant neurons, followed by a provably convergent fine-tuning process that reaches global optimality. To our knowledge, our analysis is the first to bring novel insights in understanding the optimization landscape of the MoE architecture.

Paper Structure

This paper contains 24 sections, 35 theorems, 439 equations, 3 figures.

Key Result

Theorem 1

Consider training the MoE model $f\left(\bm{\theta},{\mathbf{x}}\right)$ in (eq:moe) with respect to a teacher model given by (eq:teacher) using the gradient flow on the population MSE loss in (eq:population_loss). Let $\delta_{\mathbb{P}}\in (0, 1/7)$ be given. If $m \geq \Omega\left(m^\star\log \f Moreover, for $T^\star \leq t \leq T^\star + \mathcal{O}\left(\frac{\delta_{\mathbb{P}}\sqrt{d}}{m^

Figures (3)

  • Figure 1: Training MoE in (\ref{['eq:moe']}) on (\ref{['eq:population_loss']}) with $m^\star = 5, m = 25$, and $d= 1000$ with online batch SGD simulating GF on the population loss. Left: alignment values of the router parameters $\bar{{\mathbf{v}}}_i^\top\bar{{\mathbf{v}}}_j^\star$. Right: alignment values of the expert parameters $\bar{{\mathbf{w}}}_i^\top\bar{{\mathbf{w}}}_j^\star$.
  • Figure 2: Dynamics of the routers' and experts' alignment value with the teacher's parameter under the same set-up as Figure \ref{['fig:heatmap']}. The green curve denotes the loss value. Except for the green curve, dashed line and solid line of the same color denotes a pair of router and expert alignment value.
  • Figure 3: Plot of the property of $\pi(x)$

Theorems & Definitions (68)

  • Theorem 1
  • Lemma 1
  • Lemma 2
  • Theorem 2
  • Theorem 3
  • Definition 1: Error Bounds
  • Definition 2: Recovery Time
  • Lemma 3
  • proof
  • Lemma 4
  • ...and 58 more