Table of Contents
Fetching ...

From Sparse to Soft Mixtures of Experts

Joan Puigcerver, Carlos Riquelme, Basil Mustafa, Neil Houlsby

TL;DR

The paper tackles scaling limitations in sparse mixtures of experts by introducing Soft MoE, a fully differentiable routing mechanism that replaces hard token-to-expert assignments with soft, token-wise mixtures. This design preserves the benefits of MoEs—large capacity and selective expert computation—while eliminating token dropping and load-balancing pathologies, and it maintains tractable inference costs. Empirical results on vision tasks (e.g., JFT-4B pretraining and ImageNet variants) show Soft MoE substantially outperforming dense ViTs and traditional MoEs across upstream, few-shot, and fine-tuning metrics, with favorable training efficiency and robust scaling to hundreds of experts. The work also demonstrates versatility through contrastive learning on image-text data and analyzes design choices (slots, experts, routing patterns) to guide practical deployment. Overall, Soft MoE offers a scalable, differentiable alternative to discrete MoE routing that improves performance at comparable compute budgets, with broad applicability to vision and multimodal settings.

Abstract

Sparse mixture of expert architectures (MoEs) scale model capacity without significant increases in training or inference costs. Despite their success, MoEs suffer from a number of issues: training instability, token dropping, inability to scale the number of experts, or ineffective finetuning. In this work, we propose Soft MoE, a fully-differentiable sparse Transformer that addresses these challenges, while maintaining the benefits of MoEs. Soft MoE performs an implicit soft assignment by passing different weighted combinations of all input tokens to each expert. As in other MoEs, experts in Soft MoE only process a subset of the (combined) tokens, enabling larger model capacity (and performance) at lower inference cost. In the context of visual recognition, Soft MoE greatly outperforms dense Transformers (ViTs) and popular MoEs (Tokens Choice and Experts Choice). Furthermore, Soft MoE scales well: Soft MoE Huge/14 with 128 experts in 16 MoE layers has over 40x more parameters than ViT Huge/14, with only 2% increased inference time, and substantially better quality.

From Sparse to Soft Mixtures of Experts

TL;DR

The paper tackles scaling limitations in sparse mixtures of experts by introducing Soft MoE, a fully differentiable routing mechanism that replaces hard token-to-expert assignments with soft, token-wise mixtures. This design preserves the benefits of MoEs—large capacity and selective expert computation—while eliminating token dropping and load-balancing pathologies, and it maintains tractable inference costs. Empirical results on vision tasks (e.g., JFT-4B pretraining and ImageNet variants) show Soft MoE substantially outperforming dense ViTs and traditional MoEs across upstream, few-shot, and fine-tuning metrics, with favorable training efficiency and robust scaling to hundreds of experts. The work also demonstrates versatility through contrastive learning on image-text data and analyzes design choices (slots, experts, routing patterns) to guide practical deployment. Overall, Soft MoE offers a scalable, differentiable alternative to discrete MoE routing that improves performance at comparable compute budgets, with broad applicability to vision and multimodal settings.

Abstract

Sparse mixture of expert architectures (MoEs) scale model capacity without significant increases in training or inference costs. Despite their success, MoEs suffer from a number of issues: training instability, token dropping, inability to scale the number of experts, or ineffective finetuning. In this work, we propose Soft MoE, a fully-differentiable sparse Transformer that addresses these challenges, while maintaining the benefits of MoEs. Soft MoE performs an implicit soft assignment by passing different weighted combinations of all input tokens to each expert. As in other MoEs, experts in Soft MoE only process a subset of the (combined) tokens, enabling larger model capacity (and performance) at lower inference cost. In the context of visual recognition, Soft MoE greatly outperforms dense Transformers (ViTs) and popular MoEs (Tokens Choice and Experts Choice). Furthermore, Soft MoE scales well: Soft MoE Huge/14 with 128 experts in 16 MoE layers has over 40x more parameters than ViT Huge/14, with only 2% increased inference time, and substantially better quality.
Paper Structure (41 sections, 9 equations, 29 figures, 8 tables)

This paper contains 41 sections, 9 equations, 29 figures, 8 tables.

Figures (29)

  • Figure 1: Sparse and Soft MoE layers. While the router in Sparse MoE layers (left) learns to assign individual input tokens to each of the available slots, in Soft MoE layers (right) each slot is the result of a (different) weighted average of all the input tokens. Learning to make discrete assignments introduces several optimization and implementation issues that Soft MoE sidesteps. \ref{['app:model_inspection']} visualizes learned distributions of soft-assignments by Soft MoE.
  • Figure 2: Train Pareto frontiers. Soft MoE dominates both ViTs (dense) and popular MoEs (Experts and Tokens Choice) on the training cost / performance Pareto frontier. Larger marker sizes indicate larger models, ranging from S/32 to H/14. Cost is reported in terms of FLOPS and TPU-v3 training time. Only models on their Pareto frontier are displayed, \ref{['app:additional_results']} shows all models trained.
  • Figure 3: Models with long training durations. Models trained for 4M steps (H/14 trained only for 2M steps). Equivalent model classes (S/16, B/16, etc.) have similar training costs, but Soft MoE outperforms ViT on all metrics at a fixed training budget.
  • Figure 4: Models optimized for inference speed. Performance of models trained for more steps, thereby optimized for performance at a given inference cost (TPUv3 time or FLOPs).
  • Figure 5: Top: Performance (ImageNet) for MoEs with different number of experts (columns) and slots-per-token / assignments-per-token (rows). Bottom: Training throughput of the same models. Across the columns, the number of parameters increases, however, the theoretical cost (FLOPS) for the model (not including routing cost) remains constant. Descending the rows, the expert layers become more compute intensive as more tokens/slots are processed in the MoE layers.
  • ...and 24 more figures