Table of Contents
Fetching ...

DynaMoE: Dynamic Token-Level Expert Activation with Layer-Wise Adaptive Capacity for Mixture-of-Experts Neural Networks

Gökdeniz Gülmez

TL;DR

DynaMoE is introduced, a novel MoE framework that relaxes both constraints through dynamic token-level expert activation and layer-wise adaptive capacity allocation and establishes a new framework for adaptive computation in neural networks, providing principled guidance for MoE architecture design.

Abstract

Mixture-of-Experts (MoE) architectures have emerged as a powerful paradigm for scaling neural networks while maintaining computational efficiency. However, standard MoE implementations rely on two rigid design assumptions: (1) fixed Top-K routing where exactly K experts are activated per token, and (2) uniform expert allocation across all layers. This paper introduces DynaMoE, a novel MoE framework that relaxes both constraints through dynamic token-level expert activation and layer-wise adaptive capacity allocation. DynaMoE introduces a principled routing mechanism where the number of active experts per token varies based on input complexity. Concurrently, the framework implements six distinct scheduling strategies for distributing expert capacity across network depth, including descending, ascending, pyramid, and wave patterns. We theoretically analyze the expressivity gains of dynamic routing and derive bounds on computational efficiency. Through extensive experiments on MNIST, Fashion-MNIST, CIFAR-10 (image classification), and Recycling-the-Web (language modeling) across multiple model scales, we demonstrate that DynaMoE achieves superior parameter efficiency compared to static baselines. Our key finding is that optimal expert schedules are task- and scale-dependent: descending schedules (concentrating capacity in early layers) outperform uniform baselines on image classification. For language modeling, optimal schedules vary by model size, descending for Tiny, ascending for Small, and uniform for Medium. Furthermore, dynamic routing reduces gradient variance during training, leading to improved convergence stability. DynaMoE establishes a new framework for adaptive computation in neural networks, providing principled guidance for MoE architecture design.

DynaMoE: Dynamic Token-Level Expert Activation with Layer-Wise Adaptive Capacity for Mixture-of-Experts Neural Networks

TL;DR

DynaMoE is introduced, a novel MoE framework that relaxes both constraints through dynamic token-level expert activation and layer-wise adaptive capacity allocation and establishes a new framework for adaptive computation in neural networks, providing principled guidance for MoE architecture design.

Abstract

Mixture-of-Experts (MoE) architectures have emerged as a powerful paradigm for scaling neural networks while maintaining computational efficiency. However, standard MoE implementations rely on two rigid design assumptions: (1) fixed Top-K routing where exactly K experts are activated per token, and (2) uniform expert allocation across all layers. This paper introduces DynaMoE, a novel MoE framework that relaxes both constraints through dynamic token-level expert activation and layer-wise adaptive capacity allocation. DynaMoE introduces a principled routing mechanism where the number of active experts per token varies based on input complexity. Concurrently, the framework implements six distinct scheduling strategies for distributing expert capacity across network depth, including descending, ascending, pyramid, and wave patterns. We theoretically analyze the expressivity gains of dynamic routing and derive bounds on computational efficiency. Through extensive experiments on MNIST, Fashion-MNIST, CIFAR-10 (image classification), and Recycling-the-Web (language modeling) across multiple model scales, we demonstrate that DynaMoE achieves superior parameter efficiency compared to static baselines. Our key finding is that optimal expert schedules are task- and scale-dependent: descending schedules (concentrating capacity in early layers) outperform uniform baselines on image classification. For language modeling, optimal schedules vary by model size, descending for Tiny, ascending for Small, and uniform for Medium. Furthermore, dynamic routing reduces gradient variance during training, leading to improved convergence stability. DynaMoE establishes a new framework for adaptive computation in neural networks, providing principled guidance for MoE architecture design.
Paper Structure (72 sections, 5 theorems, 31 equations, 11 figures, 7 tables, 1 algorithm)

This paper contains 72 sections, 5 theorems, 31 equations, 11 figures, 7 tables, 1 algorithm.

Key Result

Theorem 1

Let $\mathcal{A}_K$ denote the set of distinct expert activation patterns under fixed Top-$K$ routing with $N$ experts, and $\mathcal{A}_\tau$ the corresponding set under DynaMoE with percentile threshold $\tau$ and $K_{\max} = \lceil (1-\tau)N \rceil \geq K$. Then: with strict inequality whenever $K_{\max} > K$. When $K_{\max} > K$, the ratio satisfies:

Figures (11)

  • Figure 1: Comparison of expert scheduling strategies showing how expert capacity is distributed across 12 network layers. Descending concentrates capacity in early layers, ascending in later layers, while pyramid and uniform strategies show intermediate patterns. $N_{\max}=16$, $N_{\min}=1$.
  • Figure 2: DynaMoE architecture with descending expert schedule. Expert capacity decreases from 8 experts in Layer 1 to 1 expert in Layer 4, concentrating computational resources in early feature extraction layers.
  • Figure 3: Expert activation probability heatmaps across 6 layers for three scheduling strategies. Brighter colors indicate higher activation probability. Descending schedule shows strong activation in early layers (L1-L2), uniform maintains consistent patterns, while ascending concentrates activation in deeper layers.
  • Figure 4: Performance comparison across model sizes (Tiny, Small, Medium) for language modeling. Left: Best validation perplexity (lower is better). Right: Best validation accuracy (higher is better). Results show the descending schedule achieving the best validation perplexity across all model sizes, consistent with image classification findings.
  • Figure 5: Scaling analysis on MNIST across model sizes
  • ...and 6 more figures

Theorems & Definitions (10)

  • Definition 1: Dynamic Expert Selection
  • Definition 2: Expert Schedule
  • Theorem 1: Routing Diversity Gain
  • proof : Proof Sketch
  • Proposition 1: Expected Computation
  • Theorem 2: Gradient Variance Bound
  • Definition 3: Scheduling Optimization
  • Proposition 2: Descending Optimality
  • Proposition 3: Curvature-Depth Monotonicity
  • proof