Excitation: Momentum For Experts

Sagi Shaier

Excitation: Momentum For Experts

Sagi Shaier

TL;DR

Excitation is optimizer-, domain-, and model-agnostic, requires minimal integration effort, and introduces neither additional per-parameter optimizer state nor learnable parameters, making it highly viable for memory-constrained settings.

Abstract

We propose Excitation, a novel optimization framework designed to accelerate learning in sparse architectures such as Mixture-of-Experts (MoEs). Unlike traditional optimizers that treat all parameters uniformly, Excitation dynamically modulates updates using batch-level expert utilization. It introduces a competitive update dynamic that amplifies updates to highly-utilized experts and can selectively suppress low-utilization ones, effectively sharpening routing specialization. Notably, we identify a phenomenon of "structural confusion" in deep MoEs, where standard optimizers fail to establish functional signal paths; Excitation acts as a specialization catalyst, "rescuing" these models and enabling stable training where baselines remain trapped. Excitation is optimizer-, domain-, and model-agnostic, requires minimal integration effort, and introduces neither additional per-parameter optimizer state nor learnable parameters, making it highly viable for memory-constrained settings. Across language and vision tasks, Excitation consistently improves convergence speed and final performance in MoE models, indicating that active update modulation is a key mechanism for effective conditional computation.

Excitation: Momentum For Experts

TL;DR

Abstract

Paper Structure (31 sections, 5 equations, 11 figures, 8 tables)

This paper contains 31 sections, 5 equations, 11 figures, 8 tables.

Introduction
Related Work
The Excitation Framework
Activation-Aware Modulation
Optimizer-Agnostic Implementation.
Excitation Functions
Core Formulations
Control and Ablation Variants
Experiments
Foundational Benchmark: Sparse Structural Convergence
Isolating the Targeting Effect
Generalization Across Optimizers
Rescuing Deep Networks
Expert Specialization and Confidence
Specialization and Routing Dynamics
...and 16 more sections

Figures (11)

Figure 1: Overview of Excitation.Top: Batch-level expert activations drive competitive update modulation—three strategies are available (here: ZeroSum with 1.5×/0.5× multipliers). Bottom: Optimization trajectory on a 2D toy problem using ZeroSum. Excited SGD (pink) converges faster by amplifying high-consensus expert updates (Expert 1, $w_0$) while suppressing low-utilization ones (Expert 2, $w_1$), outperforming standard SGD (gray) and SGD+Momentum (cyan).
Figure 2: CIFAR-10 Foundation Results. Targeted mechanisms ($\Phi_{ZS}, \Phi_{PS}$) significantly increase accuracy. The failure of Inverted, Random, and Global-Exp variants demonstrates that success is driven by spatial reinforcement of expert specialization rather than simple learning rate scaling.
Figure 3: Rescuing deep sparse models from training collapse. Standard optimizers remain trapped at near-random accuracy due to poor routing in deep layers. Excitation overcomes this bottleneck, enabling both SGD and Adam to "escape" the random-guessing regime and initiate meaningful specialization.
Figure 4: Routing Specialization by Depth. Gini coefficients of expert utilization. While standard optimizers exhibit a specialization "dip" in intermediate layers, Excitation maintains high selectivity and sharper routing across all network depths.
Figure 5: Evolution of Routing Entropy. Comparison of mean entropy across training. Excitation exhibits faster decay and a lower entropy floor, demonstrating that update modulation enables earlier convergence to highly specialized states.
...and 6 more figures

Excitation: Momentum For Experts

TL;DR

Abstract

Excitation: Momentum For Experts

Authors

TL;DR

Abstract

Table of Contents

Figures (11)