MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks
Xingkui Zhu, Yiran Guan, Dingkang Liang, Yuchao Chen, Yuliang Liu, Xiang Bai
TL;DR
MoE Jetpack tackles the adoption barrier of sparse mixture-of-experts (MoE) models in vision by reusing pre-trained dense checkpoints. It combines checkpoint recycling with a hyperspherical adaptive MoE (SpheroMoE) layer to initialize and fine-tune MoE models from dense weights without increasing FLOPs. The approach introduces multiple recycling strategies, with Importance-Based Weight Sampling as the default, and a dual-path MoE routing that allocates core and universal experts to balance efficiency and accuracy. Experiments on ViT and ConvNeXt across diverse vision datasets show faster convergence and higher accuracy than training MoE from scratch or using Soft MoE baselines, validating practical benefits and encouraging broader MoE adoption. The work also releases code to promote reproducibility and further research.
Abstract
The sparsely activated mixture of experts (MoE) model presents a promising alternative to traditional densely activated (dense) models, enhancing both quality and computational efficiency. However, training MoE models from scratch demands extensive data and computational resources. Moreover, public repositories like timm mainly provide pre-trained dense checkpoints, lacking similar resources for MoE models, hindering their adoption. To bridge this gap, we introduce MoE Jetpack, an effective method for fine-tuning dense checkpoints into MoE models. MoE Jetpack incorporates two key techniques: (1) checkpoint recycling, which repurposes dense checkpoints as initial weights for MoE models, thereby accelerating convergence, enhancing accuracy, and alleviating the computational burden of pre-training; (2) hyperspherical adaptive MoE (SpheroMoE) layer, which optimizes the MoE architecture for better integration of dense checkpoints, enhancing fine-tuning performance. Our experiments on vision tasks demonstrate that MoE Jetpack significantly improves convergence speed and accuracy when fine-tuning dense checkpoints into MoE models. Our code will be publicly available at https://github.com/Adlith/MoE-Jetpack.
