Table of Contents
Fetching ...

Excitation: Momentum For Experts

Sagi Shaier

TL;DR

Excitation is optimizer-, domain-, and model-agnostic, requires minimal integration effort, and introduces neither additional per-parameter optimizer state nor learnable parameters, making it highly viable for memory-constrained settings.

Abstract

We propose Excitation, a novel optimization framework designed to accelerate learning in sparse architectures such as Mixture-of-Experts (MoEs). Unlike traditional optimizers that treat all parameters uniformly, Excitation dynamically modulates updates using batch-level expert utilization. It introduces a competitive update dynamic that amplifies updates to highly-utilized experts and can selectively suppress low-utilization ones, effectively sharpening routing specialization. Notably, we identify a phenomenon of "structural confusion" in deep MoEs, where standard optimizers fail to establish functional signal paths; Excitation acts as a specialization catalyst, "rescuing" these models and enabling stable training where baselines remain trapped. Excitation is optimizer-, domain-, and model-agnostic, requires minimal integration effort, and introduces neither additional per-parameter optimizer state nor learnable parameters, making it highly viable for memory-constrained settings. Across language and vision tasks, Excitation consistently improves convergence speed and final performance in MoE models, indicating that active update modulation is a key mechanism for effective conditional computation.

Excitation: Momentum For Experts

TL;DR

Excitation is optimizer-, domain-, and model-agnostic, requires minimal integration effort, and introduces neither additional per-parameter optimizer state nor learnable parameters, making it highly viable for memory-constrained settings.

Abstract

We propose Excitation, a novel optimization framework designed to accelerate learning in sparse architectures such as Mixture-of-Experts (MoEs). Unlike traditional optimizers that treat all parameters uniformly, Excitation dynamically modulates updates using batch-level expert utilization. It introduces a competitive update dynamic that amplifies updates to highly-utilized experts and can selectively suppress low-utilization ones, effectively sharpening routing specialization. Notably, we identify a phenomenon of "structural confusion" in deep MoEs, where standard optimizers fail to establish functional signal paths; Excitation acts as a specialization catalyst, "rescuing" these models and enabling stable training where baselines remain trapped. Excitation is optimizer-, domain-, and model-agnostic, requires minimal integration effort, and introduces neither additional per-parameter optimizer state nor learnable parameters, making it highly viable for memory-constrained settings. Across language and vision tasks, Excitation consistently improves convergence speed and final performance in MoE models, indicating that active update modulation is a key mechanism for effective conditional computation.
Paper Structure (31 sections, 5 equations, 11 figures, 8 tables)

This paper contains 31 sections, 5 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Overview of Excitation.Top: Batch-level expert activations drive competitive update modulation—three strategies are available (here: ZeroSum with 1.5×/0.5× multipliers). Bottom: Optimization trajectory on a 2D toy problem using ZeroSum. Excited SGD (pink) converges faster by amplifying high-consensus expert updates (Expert 1, $w_0$) while suppressing low-utilization ones (Expert 2, $w_1$), outperforming standard SGD (gray) and SGD+Momentum (cyan).
  • Figure 2: CIFAR-10 Foundation Results. Targeted mechanisms ($\Phi_{ZS}, \Phi_{PS}$) significantly increase accuracy. The failure of Inverted, Random, and Global-Exp variants demonstrates that success is driven by spatial reinforcement of expert specialization rather than simple learning rate scaling.
  • Figure 3: Rescuing deep sparse models from training collapse. Standard optimizers remain trapped at near-random accuracy due to poor routing in deep layers. Excitation overcomes this bottleneck, enabling both SGD and Adam to "escape" the random-guessing regime and initiate meaningful specialization.
  • Figure 4: Routing Specialization by Depth. Gini coefficients of expert utilization. While standard optimizers exhibit a specialization "dip" in intermediate layers, Excitation maintains high selectivity and sharper routing across all network depths.
  • Figure 5: Evolution of Routing Entropy. Comparison of mean entropy across training. Excitation exhibits faster decay and a lower entropy floor, demonstrating that update modulation enables earlier convergence to highly specialized states.
  • ...and 6 more figures