Table of Contents
Fetching ...

CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling

Jihai Zhang, Xiaoye Qu, Tong Zhu, Yu Cheng

TL;DR

CLIP struggles to encode the full feature space, motivating diversification of CLIP representations. The authors propose Diversified Multiplet Upcycling (DMU) to generate multiple specialized CLIP models via Multistage Contrastive Learning (MCL) and assemble them into a sparse Mixture of Experts (CLIP-MoE). Using a small, high-quality image-caption dataset, the approach yields substantial gains in zero-shot retrieval and improves CLIP as a vision encoder for Multimodal Large Language Models, with minimal additional training cost. The results demonstrate that MCL-derived experts capture complementary information and that MoE routing leverages this diversity to enhance performance while controlling computation.

Abstract

Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in multimodal intelligence. However, recent studies discovered that CLIP can only encode one aspect of the feature space, leading to substantial information loss and indistinctive features. To mitigate this issue, this paper introduces a novel strategy that fine-tunes a series of complementary CLIP models and transforms them into a CLIP-MoE. Specifically, we propose a model-agnostic Diversified Multiplet Upcycling (DMU) framework for CLIP. Instead of training multiple CLIP models from scratch, DMU leverages a pre-trained CLIP and fine-tunes it into a diverse set with highly cost-effective multistage contrastive learning, thus capturing distinct feature subspaces efficiently. To fully exploit these fine-tuned models while minimizing computational overhead, we transform them into a CLIP-MoE, which dynamically activates a subset of CLIP experts, achieving an effective balance between model capacity and computational cost. Comprehensive experiments demonstrate the superior performance of CLIP-MoE across various zero-shot retrieval, zero-shot image classification tasks, and downstream Multimodal Large Language Model (MLLM) benchmarks when used as a vision encoder.

CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling

TL;DR

CLIP struggles to encode the full feature space, motivating diversification of CLIP representations. The authors propose Diversified Multiplet Upcycling (DMU) to generate multiple specialized CLIP models via Multistage Contrastive Learning (MCL) and assemble them into a sparse Mixture of Experts (CLIP-MoE). Using a small, high-quality image-caption dataset, the approach yields substantial gains in zero-shot retrieval and improves CLIP as a vision encoder for Multimodal Large Language Models, with minimal additional training cost. The results demonstrate that MCL-derived experts capture complementary information and that MoE routing leverages this diversity to enhance performance while controlling computation.

Abstract

Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in multimodal intelligence. However, recent studies discovered that CLIP can only encode one aspect of the feature space, leading to substantial information loss and indistinctive features. To mitigate this issue, this paper introduces a novel strategy that fine-tunes a series of complementary CLIP models and transforms them into a CLIP-MoE. Specifically, we propose a model-agnostic Diversified Multiplet Upcycling (DMU) framework for CLIP. Instead of training multiple CLIP models from scratch, DMU leverages a pre-trained CLIP and fine-tunes it into a diverse set with highly cost-effective multistage contrastive learning, thus capturing distinct feature subspaces efficiently. To fully exploit these fine-tuned models while minimizing computational overhead, we transform them into a CLIP-MoE, which dynamically activates a subset of CLIP experts, achieving an effective balance between model capacity and computational cost. Comprehensive experiments demonstrate the superior performance of CLIP-MoE across various zero-shot retrieval, zero-shot image classification tasks, and downstream Multimodal Large Language Model (MLLM) benchmarks when used as a vision encoder.
Paper Structure (16 sections, 5 equations, 3 figures, 6 tables)

This paper contains 16 sections, 5 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Overview of Diversified Multiplet Upcycling: Our approach involves three key steps. (a) Fine-tuning the base CLIP model using the MCL framework while freezing all parameters except for the FFN layers. This process yields a new set of FFN layers at each stage of MCL. (b) Using the obtained FFN layers as experts to initialize a CLIP-MoE. (c) Continuously fine-tuning the CLIP-MoE using both contrastive learning loss and a router balancing loss to optimize the routers. The terms ‘color’, ‘shape’, and ‘texture’ are metaphorical representations of abstract features.
  • Figure 2: Example cases comparing the performance of CLIP-MoE and OpenAI CLIP on the MMVP-VLM Benchmark, illustrating differences in their ability to capture fine-grained semantic information.
  • Figure 3: Proportion of tokens assigned to each expert on the COCO and ImageNet validation dataset. Here, we consider experts that are either selected as a first or second choice by the router.