From Sparse to Soft Mixtures of Experts
Joan Puigcerver, Carlos Riquelme, Basil Mustafa, Neil Houlsby
TL;DR
The paper tackles scaling limitations in sparse mixtures of experts by introducing Soft MoE, a fully differentiable routing mechanism that replaces hard token-to-expert assignments with soft, token-wise mixtures. This design preserves the benefits of MoEs—large capacity and selective expert computation—while eliminating token dropping and load-balancing pathologies, and it maintains tractable inference costs. Empirical results on vision tasks (e.g., JFT-4B pretraining and ImageNet variants) show Soft MoE substantially outperforming dense ViTs and traditional MoEs across upstream, few-shot, and fine-tuning metrics, with favorable training efficiency and robust scaling to hundreds of experts. The work also demonstrates versatility through contrastive learning on image-text data and analyzes design choices (slots, experts, routing patterns) to guide practical deployment. Overall, Soft MoE offers a scalable, differentiable alternative to discrete MoE routing that improves performance at comparable compute budgets, with broad applicability to vision and multimodal settings.
Abstract
Sparse mixture of expert architectures (MoEs) scale model capacity without significant increases in training or inference costs. Despite their success, MoEs suffer from a number of issues: training instability, token dropping, inability to scale the number of experts, or ineffective finetuning. In this work, we propose Soft MoE, a fully-differentiable sparse Transformer that addresses these challenges, while maintaining the benefits of MoEs. Soft MoE performs an implicit soft assignment by passing different weighted combinations of all input tokens to each expert. As in other MoEs, experts in Soft MoE only process a subset of the (combined) tokens, enabling larger model capacity (and performance) at lower inference cost. In the context of visual recognition, Soft MoE greatly outperforms dense Transformers (ViTs) and popular MoEs (Tokens Choice and Experts Choice). Furthermore, Soft MoE scales well: Soft MoE Huge/14 with 128 experts in 16 MoE layers has over 40x more parameters than ViT Huge/14, with only 2% increased inference time, and substantially better quality.
