Table of Contents
Fetching ...

Tight Clusters Make Specialized Experts

Stefan K. Nielsen, Rachel S. Y. Teo, Laziz U. Abdullaev, Tan M. Nguyen

TL;DR

This work addresses the bottleneck in Sparse MoE routing caused by latent, high-dimensional clusters by introducing the Adaptive Clustering (AC) router and ACMoE. By deriving cluster-specific feature weights $w_{qk}$ and a per-cluster scaling $M_k$, routing is performed in a cluster-adaptive transformed space that increases cluster separability and improves input-expert matching. Theoretical results show faster convergence and stronger robustness to data contamination, while empirical evaluations on WikiText-103, EnWik-8, and ImageNet demonstrate significant performance gains with negligible overhead. The approach enhances specialization of experts across semantically distinct input regions, with broad applicability across vision and language MoE backbones and robustness to adversarial and distributional shifts.

Abstract

Sparse Mixture-of-Experts (MoE) architectures have emerged as a promising approach to decoupling model capacity from computational cost. At the core of the MoE model is the router, which learns the underlying clustering structure of the input distribution in order to send input tokens to appropriate experts. However, latent clusters may be unidentifiable in high dimension, which causes slow convergence, susceptibility to data contamination, and overall degraded representations as the router is unable to perform appropriate token-expert matching. We examine the router through the lens of clustering optimization and derive optimal feature weights that maximally identify the latent clusters. We use these weights to compute the token-expert routing assignments in an adaptively transformed space that promotes well-separated clusters, which helps identify the best-matched expert for each token. In particular, for each expert cluster, we compute a set of weights that scales features according to whether that expert clusters tightly along that feature. We term this novel router the Adaptive Clustering (AC) router. Our AC router enables the MoE model to obtain three connected benefits: 1) faster convergence, 2) better robustness to data corruption, and 3) overall performance improvement, as experts are specialized in semantically distinct regions of the input space. We empirically demonstrate the advantages of our AC router over baseline routing methods when applied on a variety of MoE backbones for language modeling and image recognition tasks in both clean and corrupted settings.

Tight Clusters Make Specialized Experts

TL;DR

This work addresses the bottleneck in Sparse MoE routing caused by latent, high-dimensional clusters by introducing the Adaptive Clustering (AC) router and ACMoE. By deriving cluster-specific feature weights and a per-cluster scaling , routing is performed in a cluster-adaptive transformed space that increases cluster separability and improves input-expert matching. Theoretical results show faster convergence and stronger robustness to data contamination, while empirical evaluations on WikiText-103, EnWik-8, and ImageNet demonstrate significant performance gains with negligible overhead. The approach enhances specialization of experts across semantically distinct input regions, with broad applicability across vision and language MoE backbones and robustness to adversarial and distributional shifts.

Abstract

Sparse Mixture-of-Experts (MoE) architectures have emerged as a promising approach to decoupling model capacity from computational cost. At the core of the MoE model is the router, which learns the underlying clustering structure of the input distribution in order to send input tokens to appropriate experts. However, latent clusters may be unidentifiable in high dimension, which causes slow convergence, susceptibility to data contamination, and overall degraded representations as the router is unable to perform appropriate token-expert matching. We examine the router through the lens of clustering optimization and derive optimal feature weights that maximally identify the latent clusters. We use these weights to compute the token-expert routing assignments in an adaptively transformed space that promotes well-separated clusters, which helps identify the best-matched expert for each token. In particular, for each expert cluster, we compute a set of weights that scales features according to whether that expert clusters tightly along that feature. We term this novel router the Adaptive Clustering (AC) router. Our AC router enables the MoE model to obtain three connected benefits: 1) faster convergence, 2) better robustness to data corruption, and 3) overall performance improvement, as experts are specialized in semantically distinct regions of the input space. We empirically demonstrate the advantages of our AC router over baseline routing methods when applied on a variety of MoE backbones for language modeling and image recognition tasks in both clean and corrupted settings.

Paper Structure

This paper contains 58 sections, 7 theorems, 40 equations, 5 figures, 20 tables.

Key Result

Theorem 1

Let $s_{qk} := N_k^{-2}\sum_{r(i) = k} \sum_{r(j) = k} \rho_{ijq}$ be a measure of dispersion on the $q^{\mathrm{th}}$ feature for the representations assigned to cluster $k$. Then, for a given router function $r : \mathbb R^d \to [E]$, the corresponding optimal weights $\{\boldsymbol{w}_k\}_{k \in for $(q, k) \in [d]\times [E]$, where $\{\alpha_k\}_{k \in [E]}$ are constants that for any $\lambd

Figures (5)

  • Figure 1: ACMoE discovers semantically distinct regions. We show 14x14 image reconstructions where patches are colored by assigned experts. Top row: Swin assigns large chunks of foreground and background to one expert (red), while ACMoE accurately discovers the bird and relevant foreground. Bottom row: When the background and foreground are hard to distinguish, Swin's router fails to register the stingray (left) or shark (right) and allocates one expert for virtually the entire image. ACMoE, however, discovers the semantically distinct regions, using one expert (green) to specialize on the foreground and different experts for the background.
  • Figure 2: Fast Convergence of ACMoE. Left: Convergence speed on WikiText-103 pretraining using the Generalist Language Model du2022glam backbone. Right: Convergence speed on Banking-77 finetuning using the Switch Transformer fedus2022switch backbone. Across both backbones and tasks, we observe substantially faster convergence. We display final test perplexity (PPL) and accuracy (Acc.), showing better overall performance as well.
  • Figure 3: ACMoE and Swin Transformer under PGD attack at increasing perturbation budgets. ACMoE widens its performance gain over Swin at increasingly severe attacks in both top-1 test accuracy (left) and top-5 test accuracy (right), starting at approximately 7% improvement at 1/255 and ending at just over 10% at 5/255.
  • Figure 4: Cluster Visualization on ImageNet. Each token is represented as a point and colored by its assigned expert. Left: Swin identifies one cluster clearly (yellow/gold) but otherwise fails to distinguish remaining clusters Right: ACMoE learns better-defined expert clusters.
  • Figure 5: Router Instability of ACMoE, SMoE, XMoE, and StableMoE. ACMoE maintains consistent routing, while baseline routers more frequently change the expert assignments of tokens.

Theorems & Definitions (13)

  • Theorem 1: Optimal feature weights
  • Definition 1: Adaptive Clustering Router Transformation $\boldsymbol{M}_k$
  • Definition 2: Adaptive Clustering Router and MoE Layer
  • Remark 1
  • Remark 2
  • Lemma 1: Adaptive Clustering Router Transformation Increases Cluster Separation
  • Lemma 2: Incorrect Assignment Probability
  • Remark 3
  • Proposition 1: Robustness of ACMoE
  • Proposition 2: Faster convergence of ACMoE
  • ...and 3 more