MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks

Xingkui Zhu; Yiran Guan; Dingkang Liang; Yuchao Chen; Yuliang Liu; Xiang Bai

MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks

Xingkui Zhu, Yiran Guan, Dingkang Liang, Yuchao Chen, Yuliang Liu, Xiang Bai

TL;DR

MoE Jetpack tackles the adoption barrier of sparse mixture-of-experts (MoE) models in vision by reusing pre-trained dense checkpoints. It combines checkpoint recycling with a hyperspherical adaptive MoE (SpheroMoE) layer to initialize and fine-tune MoE models from dense weights without increasing FLOPs. The approach introduces multiple recycling strategies, with Importance-Based Weight Sampling as the default, and a dual-path MoE routing that allocates core and universal experts to balance efficiency and accuracy. Experiments on ViT and ConvNeXt across diverse vision datasets show faster convergence and higher accuracy than training MoE from scratch or using Soft MoE baselines, validating practical benefits and encouraging broader MoE adoption. The work also releases code to promote reproducibility and further research.

Abstract

The sparsely activated mixture of experts (MoE) model presents a promising alternative to traditional densely activated (dense) models, enhancing both quality and computational efficiency. However, training MoE models from scratch demands extensive data and computational resources. Moreover, public repositories like timm mainly provide pre-trained dense checkpoints, lacking similar resources for MoE models, hindering their adoption. To bridge this gap, we introduce MoE Jetpack, an effective method for fine-tuning dense checkpoints into MoE models. MoE Jetpack incorporates two key techniques: (1) checkpoint recycling, which repurposes dense checkpoints as initial weights for MoE models, thereby accelerating convergence, enhancing accuracy, and alleviating the computational burden of pre-training; (2) hyperspherical adaptive MoE (SpheroMoE) layer, which optimizes the MoE architecture for better integration of dense checkpoints, enhancing fine-tuning performance. Our experiments on vision tasks demonstrate that MoE Jetpack significantly improves convergence speed and accuracy when fine-tuning dense checkpoints into MoE models. Our code will be publicly available at https://github.com/Adlith/MoE-Jetpack.

MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks

TL;DR

Abstract

Paper Structure (17 sections, 11 equations, 7 figures, 8 tables)

This paper contains 17 sections, 11 equations, 7 figures, 8 tables.

Introduction
Background
MoE Jetpack
Checkpoint Recycling
SpheroMoE Layer
Experiments
Experimental Setups
Main Results
Ablations
Analysis
Related Work
Conclusion
Detailed Model Configurations
Experiment Settings and Time Costs
Implementation of SpheroMoE Layer
...and 2 more sections

Figures (7)

Figure 1: (a) Our MoE Jetpack converts pre-trained dense models into MoE models, enhancing convergence and performance while maintaining equivalent FLOPs. Here, Exp. represents individual experts, $E$ denotes the number of experts, and $L$ indicates the total number of layers. (b) Performance comparison of ViT trained from scratch, pre-trained ViT, Soft MoE softmoe trained from scratch, and MoE Jetpack across various datasets. MoE Jetpack shows significant performance improvements.
Figure 2: (a) Checkpoint Recycling selects neurons and channels from the MLP of pre-trained dense checkpoints using weight sampling methods. This process transforms pre-trained knowledge into multiple experts of any size for initializing MoE models. (b) The SpheroMoE layer uses cross-attention to adaptively dispatch input tokens to expert slots. It starts with a randomly initialized query and uses keys and values derived and normalized from the input. The similarity logits between the query and key are calculated in a hyperspherical space, stabilizing the random query. The outputs from the experts are then combined back into the input using the generated similarity logits.
Figure 3: The Adaptive Dual-path MoE structure enhances the SpheroMoE Router by adapting it into a dual-branch system, designed to optimize computational efficiency and model performance. This configuration directs high-impact tokens to a core path with fewer but larger experts, while routing less critical tokens to a universal path equipped with a greater number of smaller experts.
Figure 4: This chart shows CIFAR-100 accuracy across different ratios of core (dark) to universal (light) experts, highlighting optimal performance at a 1/3 core ratio.
Figure 5: Comparison of convergence speeds using MoE Jetpack versus training from scratch on ImageNet (left) and CIFAR-100 (right). MoE Jetpack achieves target accuracies significantly faster, demonstrating a 2x speed increase on ImageNet and an 8x increase on CIFAR-100.
...and 2 more figures

MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks

TL;DR

Abstract

MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (7)