Table of Contents
Fetching ...

Mixture of Experts in Image Classification: What's the Sweet Spot?

Mathurin Videau, Alessandro Leite, Marc Schoenauer, Olivier Teytaud

TL;DR

This paper systematically studies integrating Mixture-of-Experts (MoE) into image classification architectures, focusing on ConvNeXt and Vision Transformer backbones trained on ImageNet-1k and ImageNet-21k. It finds that moderate per-sample activation (several active experts) provides the best balance of accuracy and efficiency, while activating too many parameters yields diminishing returns, especially in large models. The results show MoE most benefits tiny to mid-sized models, with a robust Last-2 placement strategy and a simple linear router delivering the best overall performance; larger datasets permit more experts (up to 16 for ConvNeXt) without changing placement. However, MoE does not redefine state-of-the-art ImageNet performance in large models and sometimes degrades robustness on out-of-distribution data, highlighting a scale- and context-dependent usefulness. Overall, the work yields practical design principles for vision MoE systems and emphasizes the importance of architecture choice and data scale in realizing MoE benefits.

Abstract

Mixture-of-Experts (MoE) models have shown promising potential for parameter-efficient scaling across domains. However, their application to image classification remains limited, often requiring billion-scale datasets to be competitive. In this work, we explore the integration of MoE layers into image classification architectures using open datasets. We conduct a systematic analysis across different MoE configurations and model scales. We find that moderate parameter activation per sample provides the best trade-off between performance and efficiency. However, as the number of activated parameters increases, the benefits of MoE diminish. Our analysis yields several practical insights for vision MoE design. First, MoE layers most effectively strengthen tiny and mid-sized models, while gains taper off for large-capacity networks and do not redefine state-of-the-art ImageNet performance. Second, a Last-2 placement heuristic offers the most robust cross-architecture choice, with Every-2 slightly better for Vision Transform (ViT), and both remaining effective as data and model scale increase. Third, larger datasets (e.g., ImageNet-21k) allow more experts, up to 16, for ConvNeXt to be utilized effectively without changing placement, as increased data reduces overfitting and promotes broader expert specialization. Finally, a simple linear router performs best, suggesting that additional routing complexity yields no consistent benefit.

Mixture of Experts in Image Classification: What's the Sweet Spot?

TL;DR

This paper systematically studies integrating Mixture-of-Experts (MoE) into image classification architectures, focusing on ConvNeXt and Vision Transformer backbones trained on ImageNet-1k and ImageNet-21k. It finds that moderate per-sample activation (several active experts) provides the best balance of accuracy and efficiency, while activating too many parameters yields diminishing returns, especially in large models. The results show MoE most benefits tiny to mid-sized models, with a robust Last-2 placement strategy and a simple linear router delivering the best overall performance; larger datasets permit more experts (up to 16 for ConvNeXt) without changing placement. However, MoE does not redefine state-of-the-art ImageNet performance in large models and sometimes degrades robustness on out-of-distribution data, highlighting a scale- and context-dependent usefulness. Overall, the work yields practical design principles for vision MoE systems and emphasizes the importance of architecture choice and data scale in realizing MoE benefits.

Abstract

Mixture-of-Experts (MoE) models have shown promising potential for parameter-efficient scaling across domains. However, their application to image classification remains limited, often requiring billion-scale datasets to be competitive. In this work, we explore the integration of MoE layers into image classification architectures using open datasets. We conduct a systematic analysis across different MoE configurations and model scales. We find that moderate parameter activation per sample provides the best trade-off between performance and efficiency. However, as the number of activated parameters increases, the benefits of MoE diminish. Our analysis yields several practical insights for vision MoE design. First, MoE layers most effectively strengthen tiny and mid-sized models, while gains taper off for large-capacity networks and do not redefine state-of-the-art ImageNet performance. Second, a Last-2 placement heuristic offers the most robust cross-architecture choice, with Every-2 slightly better for Vision Transform (ViT), and both remaining effective as data and model scale increase. Third, larger datasets (e.g., ImageNet-21k) allow more experts, up to 16, for ConvNeXt to be utilized effectively without changing placement, as increased data reduces overfitting and promotes broader expert specialization. Finally, a simple linear router performs best, suggesting that additional routing complexity yields no consistent benefit.

Paper Structure

This paper contains 26 sections, 1 equation, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Vision Transformer and ConvNext architectures.
  • Figure 2: Left: Pareto-front for ImageNet21k, x-axis = number of activations per sample. ViT models have been presented in MoE versions only after additional pretraining, and are therefore not presented. MoE seems to be Pareto optimal for a number of activations per sample below 90M. Right: Pareto-front for ViT models on JFT-300M. Overall, MoE is never validated for a number of activations per sample above $\approx$ 100M.
  • Figure 3: Exploring the interplay between the size and count of experts in a MoE layer for ConvNeXt -T on ImageNet-1K. Baseline results (without MoE) are denoted by dotted lines. For this small dataset, MoE does not bring much improvement (see \ref{['fig:pfs']} for bigger datasets).
  • Figure 4: Improvement from MoE (y-axis) vs. baseline accuracy without MoE (x-axis) for image classification, across pretraining sizes. Larger pretraining yields greater gains; higher baseline accuracy reduces impact. Dashed lines show maximal improvement per accuracy level. Results are drawn from \ref{['tab:21k:reduced', 'tab:1k:reduced', 'tab:21k:moe']} (in \ref{['app:moedense']}).
  • Figure 5: Distribution of experts
  • ...and 6 more figures