S2MoE: Robust Sparse Mixture of Experts via Stochastic Learning
Giang Do, Hung Le, Truyen Tran
TL;DR
Sparse Mixture of Experts (SMoE) enable efficient model training but suffer from representation collapse and limited expert dimensionality. S2MoE introduces a Gaussian Noise Module and a gating mechanism to learn from both deterministic and noise-augmented inputs, trained with a task loss plus balancing and uncertainty losses (InfoNCE) to diversify expert representations. The approach achieves comparable accuracy to state-of-the-art routing methods while reducing inference costs by up to 28% by activating fewer experts, and it shows strong pre-training and fine-tuning performance across multiple NLP tasks. This work advances practical deployment of SMoEs in large language models by enhancing feature learning and mitigating collapse without increasing inference complexity.
Abstract
Sparse Mixture of Experts (SMoE) enables efficient training of large language models by routing input tokens to a select number of experts. However, training SMoE remains challenging due to the issue of representation collapse. Recent studies have focused on improving the router to mitigate this problem, but existing approaches face two key limitations: (1) expert embeddings are significantly smaller than the model's dimension, contributing to representation collapse, and (2) routing each input to the Top-K experts can cause them to learn overly similar features. In this work, we propose a novel approach called Robust Sparse Mixture of Experts via Stochastic Learning (S2MoE), which is a mixture of experts designed to learn from both deterministic and non-deterministic inputs via Learning under Uncertainty. Extensive experiments across various tasks demonstrate that S2MoE achieves performance comparable to other routing methods while reducing computational inference costs by 28%.
