Table of Contents
Fetching ...

S2MoE: Robust Sparse Mixture of Experts via Stochastic Learning

Giang Do, Hung Le, Truyen Tran

TL;DR

Sparse Mixture of Experts (SMoE) enable efficient model training but suffer from representation collapse and limited expert dimensionality. S2MoE introduces a Gaussian Noise Module and a gating mechanism to learn from both deterministic and noise-augmented inputs, trained with a task loss plus balancing and uncertainty losses (InfoNCE) to diversify expert representations. The approach achieves comparable accuracy to state-of-the-art routing methods while reducing inference costs by up to 28% by activating fewer experts, and it shows strong pre-training and fine-tuning performance across multiple NLP tasks. This work advances practical deployment of SMoEs in large language models by enhancing feature learning and mitigating collapse without increasing inference complexity.

Abstract

Sparse Mixture of Experts (SMoE) enables efficient training of large language models by routing input tokens to a select number of experts. However, training SMoE remains challenging due to the issue of representation collapse. Recent studies have focused on improving the router to mitigate this problem, but existing approaches face two key limitations: (1) expert embeddings are significantly smaller than the model's dimension, contributing to representation collapse, and (2) routing each input to the Top-K experts can cause them to learn overly similar features. In this work, we propose a novel approach called Robust Sparse Mixture of Experts via Stochastic Learning (S2MoE), which is a mixture of experts designed to learn from both deterministic and non-deterministic inputs via Learning under Uncertainty. Extensive experiments across various tasks demonstrate that S2MoE achieves performance comparable to other routing methods while reducing computational inference costs by 28%.

S2MoE: Robust Sparse Mixture of Experts via Stochastic Learning

TL;DR

Sparse Mixture of Experts (SMoE) enable efficient model training but suffer from representation collapse and limited expert dimensionality. S2MoE introduces a Gaussian Noise Module and a gating mechanism to learn from both deterministic and noise-augmented inputs, trained with a task loss plus balancing and uncertainty losses (InfoNCE) to diversify expert representations. The approach achieves comparable accuracy to state-of-the-art routing methods while reducing inference costs by up to 28% by activating fewer experts, and it shows strong pre-training and fine-tuning performance across multiple NLP tasks. This work advances practical deployment of SMoEs in large language models by enhancing feature learning and mitigating collapse without increasing inference complexity.

Abstract

Sparse Mixture of Experts (SMoE) enables efficient training of large language models by routing input tokens to a select number of experts. However, training SMoE remains challenging due to the issue of representation collapse. Recent studies have focused on improving the router to mitigate this problem, but existing approaches face two key limitations: (1) expert embeddings are significantly smaller than the model's dimension, contributing to representation collapse, and (2) routing each input to the Top-K experts can cause them to learn overly similar features. In this work, we propose a novel approach called Robust Sparse Mixture of Experts via Stochastic Learning (S2MoE), which is a mixture of experts designed to learn from both deterministic and non-deterministic inputs via Learning under Uncertainty. Extensive experiments across various tasks demonstrate that S2MoE achieves performance comparable to other routing methods while reducing computational inference costs by 28%.

Paper Structure

This paper contains 17 sections, 9 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: BPC (Bits-per-character) on the Text8 dataset with varying numbers of experts used for inference. S2MoE requires the activation of only one expert to achieve comparable performance with other routing methods, resulting in a savings of 28% in computational inference costs. All methods have the same FLOPs.
  • Figure 2: An illustration of our S2MoE that enhances model knowledge through Gaussian noise generation. The method involves two components: learning from the original input and the noise-augmented input concurrently through SMoE, with their outputs combined by a gating network implemented as a 1-layer MLP. Best viewed in colors.