Routers in Vision Mixture of Experts: An Empirical Study

Tianlin Liu; Mathieu Blondel; Carlos Riquelme; Joan Puigcerver

Routers in Vision Mixture of Experts: An Empirical Study

Tianlin Liu, Mathieu Blondel, Carlos Riquelme, Joan Puigcerver

TL;DR

This work presents a unified MoE layer formulation for vision tasks, anchored by routing tensors that map token inputs to dispatch and combine operations. By instantiating six routers—Softmax Token Choice, Sinkhorn Token Choice, Softmax Expert Choice, Sinkhorn Expert Choice, Sparsity-constrained Expert Choice, and Soft MoE—the study provides a head-to-head empirical comparison across ViT-based MoEs with a fixed 32 experts. Key findings show that Expert Choice strategies generally outperform Token Choice in sparse MoEs, while Soft MoEs achieve the best performance-cost efficiency, particularly with larger expert counts. The results underscore the central role of routing design in vision MoEs and suggest Soft MoEs as a robust default for scalable vision modeling, with Sinkhorn-based routing offering practical balance between throughput and accuracy. Overall, the paper extends MoE insights from language to vision, offering a cohesive framework and empirical guidance for router choices in large-scale vision models.

Abstract

Mixture-of-Experts (MoE) models are a promising way to scale up model capacity without significantly increasing computational cost. A key component of MoEs is the router, which decides which subset of parameters (experts) process which feature embeddings (tokens). In this paper, we present a comprehensive study of routers in MoEs for computer vision tasks. We introduce a unified MoE formulation that subsumes different MoEs with two parametric routing tensors. This formulation covers both sparse MoE, which uses a binary or hard assignment between experts and tokens, and soft MoE, which uses a soft assignment between experts and weighted combinations of tokens. Routers for sparse MoEs can be further grouped into two variants: Token Choice, which matches experts to each token, and Expert Choice, which matches tokens to each expert. We conduct head-to-head experiments with 6 different routers, including existing routers from prior work and new ones we introduce. We show that (i) many routers originally developed for language modeling can be adapted to perform strongly in vision tasks, (ii) in sparse MoE, Expert Choice routers generally outperform Token Choice routers, and (iii) soft MoEs generally outperform sparse MoEs with a fixed compute budget. These results provide new insights regarding the crucial role of routers in vision MoE models.

Routers in Vision Mixture of Experts: An Empirical Study

TL;DR

Abstract

Paper Structure (31 sections, 16 equations, 3 figures, 6 tables, 2 algorithms)

This paper contains 31 sections, 16 equations, 3 figures, 6 tables, 2 algorithms.

Notations.
A Unified Formulation of MoE Layers
A motivating example of MoE layer
Unifying MoE layers through routers
Recovering the Softmax Token Choice layer as a special case.
MoE routers.
MoE layers instantiated by different routers
Softmax Token Choice router
Balancing expert usage.
Sinkhorn Token Choice router
Softmax Expert Choice router
Sinkhorn Expert Choice router
Sparsity-constrained Expert Choice router
Soft MoE
Experiments
...and 16 more sections

Figures (3)

Figure 1: Comparison of training time and performance in the JFT300M dataset for image classification. The marker size represents the router's capacity, with smaller and larger sizes indicating lower and higher capacities.
Figure 2: Comparison of training time and performance in a 10-shot Transfer Task on the ImageNet-1k Dataset. The marker size represents the router's capacity, with smaller and larger sizes indicating lower and higher capacities.
Figure 3: Assessing the impact of using Softmax-based combine tensors in Sinkhorn Token Choice routers. The number of selected expets are $k=1$ (left panel) and $k=2$ (right panel). Both routers are used in a B32 architecture. The $x$ axis shows the training iteration number, while the $y$ axis shows the validation accuracy on the JFT300M dataset.

Routers in Vision Mixture of Experts: An Empirical Study

TL;DR

Abstract

Routers in Vision Mixture of Experts: An Empirical Study

Authors

TL;DR

Abstract

Table of Contents

Figures (3)