Table of Contents
Fetching ...

Mixture of Experts with Soft Nearest Neighbor Loss: Resolving Expert Collapse via Representation Disentanglement

Abien Fred Agarap, Arnulfo P. Azcarraga

Abstract

The Mixture-of-Experts (MoE) model uses a set of expert networks that specialize on subsets of a dataset under the supervision of a gating network. A common issue in MoE architectures is ``expert collapse'' where overlapping class boundaries in the raw input feature space cause multiple experts to learn redundant representations, thus forcing the gating network into rigid routing to compensate. We propose an enhanced MoE architecture that utilizes a feature extractor network optimized using Soft Nearest Neighbor Loss (SNNL) prior to feeding input features to the gating and expert networks. By pre-conditioning the latent space to minimize distances among class-similar data points, we resolve structural expert collapse which results to experts learning highly orthogonal weights. We employ Expert Specialization Entropy and Pairwise Embedding Similarity to quantify this dynamic. We evaluate our experimental approach across four benchmark image classification datasets (MNIST, FashionMNIST, CIFAR10, and CIFAR100), and we show our SNNL-augmented MoE models demonstrate structurally diverse experts which allow the gating network to adopt a more flexible routing strategy. This paradigm significantly improves classification accuracy on the FashionMNIST, CIFAR10, and CIFAR100 datasets.

Mixture of Experts with Soft Nearest Neighbor Loss: Resolving Expert Collapse via Representation Disentanglement

Abstract

The Mixture-of-Experts (MoE) model uses a set of expert networks that specialize on subsets of a dataset under the supervision of a gating network. A common issue in MoE architectures is ``expert collapse'' where overlapping class boundaries in the raw input feature space cause multiple experts to learn redundant representations, thus forcing the gating network into rigid routing to compensate. We propose an enhanced MoE architecture that utilizes a feature extractor network optimized using Soft Nearest Neighbor Loss (SNNL) prior to feeding input features to the gating and expert networks. By pre-conditioning the latent space to minimize distances among class-similar data points, we resolve structural expert collapse which results to experts learning highly orthogonal weights. We employ Expert Specialization Entropy and Pairwise Embedding Similarity to quantify this dynamic. We evaluate our experimental approach across four benchmark image classification datasets (MNIST, FashionMNIST, CIFAR10, and CIFAR100), and we show our SNNL-augmented MoE models demonstrate structurally diverse experts which allow the gating network to adopt a more flexible routing strategy. This paradigm significantly improves classification accuracy on the FashionMNIST, CIFAR10, and CIFAR100 datasets.

Paper Structure

This paper contains 23 sections, 8 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Architecture of the CNN-based Feature Extractor. This module computes disentangled representations that are subsequently routed to the MoE gating network and experts.
  • Figure 2: The mixture-of-experts model is a system of experts and gating networks where each expert become a function of a subset of the input environment. These expert networks receive the same inputs and produce the same number of outputs. The gating network also receives the same input as the expert networks, but its output is the probability of choosing a particular expert on a given input.
  • Figure 3: We optimize the soft nearest neighbor loss over the hidden layers found in the feature extractor network before the MoE model. In doing so, the input features to the expert and gating networks are transformed to a set of representations with the classification information ingrained in them, thereby helping improve the overall classification performance of the MoE model.
  • Figure 4: Pairwise cosine similarity between expert weight matrices on CIFAR10. (Left) The Baseline shows distinct patches of redundancy (e.g. between Expert 0 and 3). (Right) The SNNL condition visibly suppresses these redundancies, thus forcing experts to learn highly distinct and orthogonal features.
  • Figure 5: Distribution of Test Accuracy and Routing Entropy across random seeds. (Top/CIFAR100) SNNL provides a robust, highly stable boost to both accuracy and entropy, elevating the entire interquartile range. (Bottom/CIFAR10) While median entropy slightly increases, the wider variance and overlapping accuracy boxes explain the non-significant accuracy gain on this intermediate dataset.
  • ...and 2 more figures