Table of Contents
Fetching ...

CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling

Xinze Wang, Chen Chen, Yinfei Yang, Hong-You Chen, Bowen Zhang, Aditya Pal, Xiangxin Zhu, Xianzhi Du

TL;DR

This paper tackles the high cost of scaling CLIP with Mixture-of-Experts by introducing CLIP-UP, a simple sparse upcycling method that converts a pre-trained dense CLIP into a sparse MoE with a separated backbone and lightweight auxiliary losses. CLIP-UP initializes MoE layers from the dense checkpoint and uses a reduced learning rate with load balance and router-$z$ losses to stabilize training, avoiding LIMOE’s entropy losses. The authors show that CLIP-UP reduces training costs and inference FLOPs while achieving superior retrieval performance, notably outperforming dense baselines on COCO and Flickr30K recall@1 (e.g., +7.2% and +6.6%), and even surpassing larger dense models with only a fraction of compute, with demonstrated scalability from $B/32$ to $L/14$. These results offer a practical, scalable path to efficient high-performance CLIP models, with robust improvements across scales and a clear trade-off controlled by expert capacity. The approach broadens the applicability of MoE in multimodal pretraining by enabling cost-efficient upcycling that preserves or enhances retrieval quality while reducing compute.

Abstract

Mixture-of-Experts (MoE) models are crucial for scaling model capacity while controlling inference costs. While integrating MoE into multimodal models like CLIP improves performance, training these models is notoriously challenging and expensive. We propose CLIP-Upcycling (CLIP-UP), an efficient alternative training strategy that converts a pre-trained dense CLIP model into a sparse MoE architecture. Through extensive experimentation with various settings and auxiliary losses, we demonstrate that CLIP-UP significantly reduces training complexity and cost. Remarkably, our sparse CLIP B/16 model, trained with CLIP-UP, outperforms its dense counterpart by 7.2% and 6.6% on COCO and Flickr30k text-to-image Recall@1 benchmarks respectively. It even surpasses the larger CLIP L/14 model on this task while using only 30% of the inference FLOPs. We further demonstrate the generalizability of our training recipe across different scales, establishing sparse upcycling as a practical and scalable approach for building efficient, high-performance CLIP models.

CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling

TL;DR

This paper tackles the high cost of scaling CLIP with Mixture-of-Experts by introducing CLIP-UP, a simple sparse upcycling method that converts a pre-trained dense CLIP into a sparse MoE with a separated backbone and lightweight auxiliary losses. CLIP-UP initializes MoE layers from the dense checkpoint and uses a reduced learning rate with load balance and router- losses to stabilize training, avoiding LIMOE’s entropy losses. The authors show that CLIP-UP reduces training costs and inference FLOPs while achieving superior retrieval performance, notably outperforming dense baselines on COCO and Flickr30K recall@1 (e.g., +7.2% and +6.6%), and even surpassing larger dense models with only a fraction of compute, with demonstrated scalability from to . These results offer a practical, scalable path to efficient high-performance CLIP models, with robust improvements across scales and a clear trade-off controlled by expert capacity. The approach broadens the applicability of MoE in multimodal pretraining by enabling cost-efficient upcycling that preserves or enhances retrieval quality while reducing compute.

Abstract

Mixture-of-Experts (MoE) models are crucial for scaling model capacity while controlling inference costs. While integrating MoE into multimodal models like CLIP improves performance, training these models is notoriously challenging and expensive. We propose CLIP-Upcycling (CLIP-UP), an efficient alternative training strategy that converts a pre-trained dense CLIP model into a sparse MoE architecture. Through extensive experimentation with various settings and auxiliary losses, we demonstrate that CLIP-UP significantly reduces training complexity and cost. Remarkably, our sparse CLIP B/16 model, trained with CLIP-UP, outperforms its dense counterpart by 7.2% and 6.6% on COCO and Flickr30k text-to-image Recall@1 benchmarks respectively. It even surpasses the larger CLIP L/14 model on this task while using only 30% of the inference FLOPs. We further demonstrate the generalizability of our training recipe across different scales, establishing sparse upcycling as a practical and scalable approach for building efficient, high-performance CLIP models.

Paper Structure

This paper contains 29 sections, 5 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Our proposed MoE CLIP pre-training recipe. We highlight key factors for efficient training, including backbone sharing, training from scratch vs. sparse upcycling, and auxiliary losses. A detailed analysis is provided in Section \ref{['comparison-methodology']} and Section \ref{['limoe-loss']}.
  • Figure 2: CLIP-UP overview with sparse upcycling initialization. Selected MLP layers are replaced with MoE layers, initialized from the dense checkpoint, while routers are randomly initialized.
  • Figure 3: Impact of LIMOE auxiliary loss under different training setups. Adding LIMOE loss sometimes causes instability, especially with unshared backbones, while our upcycling recipe remains more robust.
  • Figure 4: Performance vs. training EFLOPS for CLIP-UP and sparse-from-scratch model on CLIP B/16.
  • Figure 5: CLIP-UP with MoE upcycling for only the text encoder, image encoder, or both. We observe upcycling both the image and text encoders into MoE generally helps, especially for retrieval tasks.
  • ...and 6 more figures