Table of Contents
Fetching ...

ViMoE: An Empirical Study of Designing Vision Mixture-of-Experts

Xumeng Han, Longhui Wei, Zhiyang Dou, Zipeng Wang, Chenhui Qiang, Xin He, Yingfei Sun, Zhenjun Han, Qi Tian

TL;DR

A shared expert to learn and capture common knowledge is introduced, serving as an effective way to construct a stable ViMoE, and how to analyze expert routing behavior is demonstrated, revealing which MoE layers are capable of specializing in handling specific information and which are not.

Abstract

Mixture-of-Experts (MoE) models embody the divide-and-conquer concept and are a promising approach for increasing model capacity, demonstrating excellent scalability across multiple domains. In this paper, we integrate the MoE structure into the classic Vision Transformer (ViT), naming it ViMoE, and explore the potential of applying MoE to vision through a comprehensive study on image classification and semantic segmentation. However, we observe that the performance is sensitive to the configuration of MoE layers, making it challenging to obtain optimal results without careful design. The underlying cause is that inappropriate MoE layers lead to unreliable routing and hinder experts from effectively acquiring helpful information. To address this, we introduce a shared expert to learn and capture common knowledge, serving as an effective way to construct stable ViMoE. Furthermore, we demonstrate how to analyze expert routing behavior, revealing which MoE layers are capable of specializing in handling specific information and which are not. This provides guidance for retaining the critical layers while removing redundancies, thereby advancing ViMoE to be more efficient without sacrificing accuracy. We aspire for this work to offer new insights into the design of vision MoE models and provide valuable empirical guidance for future research.

ViMoE: An Empirical Study of Designing Vision Mixture-of-Experts

TL;DR

A shared expert to learn and capture common knowledge is introduced, serving as an effective way to construct a stable ViMoE, and how to analyze expert routing behavior is demonstrated, revealing which MoE layers are capable of specializing in handling specific information and which are not.

Abstract

Mixture-of-Experts (MoE) models embody the divide-and-conquer concept and are a promising approach for increasing model capacity, demonstrating excellent scalability across multiple domains. In this paper, we integrate the MoE structure into the classic Vision Transformer (ViT), naming it ViMoE, and explore the potential of applying MoE to vision through a comprehensive study on image classification and semantic segmentation. However, we observe that the performance is sensitive to the configuration of MoE layers, making it challenging to obtain optimal results without careful design. The underlying cause is that inappropriate MoE layers lead to unreliable routing and hinder experts from effectively acquiring helpful information. To address this, we introduce a shared expert to learn and capture common knowledge, serving as an effective way to construct stable ViMoE. Furthermore, we demonstrate how to analyze expert routing behavior, revealing which MoE layers are capable of specializing in handling specific information and which are not. This provides guidance for retaining the critical layers while removing redundancies, thereby advancing ViMoE to be more efficient without sacrificing accuracy. We aspire for this work to offer new insights into the design of vision MoE models and provide valuable empirical guidance for future research.

Paper Structure

This paper contains 15 sections, 4 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Top-1 accuracy on ImageNet-1K. We compare ViMoE with other ViT architecture baselines. All models are evaluated at resolution $224\times224$.
  • Figure 2: Top-1 accuracy on ImageNet-1K under different values of $L$. We replace the FFNs with MoE layers in the last $L$ ViT blocks. $L=0$ represents the non-MoE DINOv2 baseline, and $L=12$ indicates that every block contains the MoE layer.
  • Figure 3: Training curves for various ViMoE configurations.
  • Figure 4: Routing heatmap of the $l$-th MoE layer, where $l=1$ represents the deepest (last) layer and $l=12$ denotes the shallowest (first) layer. The $x$-axis is the expert ID, and the $y$-axis is the class ID from ImageNet-1K. The label order in each figure is adjusted for better readability. Darker colors indicate a higher proportion of images from the corresponding class routed to the expert.
  • Figure 5: Routing heatmap of the $l$-th MoE layer for semantic segmentation on ADE20K, where $l=1$ represents the deepest (last) layer and $l=12$ denotes the shallowest (first) layer. Routing operates at the token level, where each image patch is allocated to an expert. The $x$-axis is the expert ID, and the $y$-axis is the class ID. The label order in each figure is adjusted for better readability. Darker colors indicate a higher proportion of images from the corresponding class routed to the expert.
  • ...and 5 more figures