Mixture of Experts in Image Classification: What's the Sweet Spot?
Mathurin Videau, Alessandro Leite, Marc Schoenauer, Olivier Teytaud
TL;DR
This paper systematically studies integrating Mixture-of-Experts (MoE) into image classification architectures, focusing on ConvNeXt and Vision Transformer backbones trained on ImageNet-1k and ImageNet-21k. It finds that moderate per-sample activation (several active experts) provides the best balance of accuracy and efficiency, while activating too many parameters yields diminishing returns, especially in large models. The results show MoE most benefits tiny to mid-sized models, with a robust Last-2 placement strategy and a simple linear router delivering the best overall performance; larger datasets permit more experts (up to 16 for ConvNeXt) without changing placement. However, MoE does not redefine state-of-the-art ImageNet performance in large models and sometimes degrades robustness on out-of-distribution data, highlighting a scale- and context-dependent usefulness. Overall, the work yields practical design principles for vision MoE systems and emphasizes the importance of architecture choice and data scale in realizing MoE benefits.
Abstract
Mixture-of-Experts (MoE) models have shown promising potential for parameter-efficient scaling across domains. However, their application to image classification remains limited, often requiring billion-scale datasets to be competitive. In this work, we explore the integration of MoE layers into image classification architectures using open datasets. We conduct a systematic analysis across different MoE configurations and model scales. We find that moderate parameter activation per sample provides the best trade-off between performance and efficiency. However, as the number of activated parameters increases, the benefits of MoE diminish. Our analysis yields several practical insights for vision MoE design. First, MoE layers most effectively strengthen tiny and mid-sized models, while gains taper off for large-capacity networks and do not redefine state-of-the-art ImageNet performance. Second, a Last-2 placement heuristic offers the most robust cross-architecture choice, with Every-2 slightly better for Vision Transform (ViT), and both remaining effective as data and model scale increase. Third, larger datasets (e.g., ImageNet-21k) allow more experts, up to 16, for ConvNeXt to be utilized effectively without changing placement, as increased data reduces overfitting and promotes broader expert specialization. Finally, a simple linear router performs best, suggesting that additional routing complexity yields no consistent benefit.
