Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters

Jiazuo Yu; Yunzhi Zhuge; Lu Zhang; Ping Hu; Dong Wang; Huchuan Lu; You He

Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters

Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Ping Hu, Dong Wang, Huchuan Lu, You He

TL;DR

Continual learning for large vision-language models faces catastrophic forgetting and high computational cost. The authors propose a parameter-efficient solution by freezing CLIP and introducing Incremental Mixture-of-Experts Adapters (MoE-Adapters) with task-specific routers and a Distribution Discriminative Auto-Selector (DDAS) that routes seen inputs to adapters and unseen inputs to CLIP; gating yields $W^t = Softmax(Topk(\mathcal{R}^t(\mathbf{c}^t)))$ and outputs $\mathbf{y}^t = \sum_{i=1}^{N_E} W_i^t \mathcal{E}_i(\mathbf{x}^t)$. DDAS leverages per-task autoencoders with a threshold $Thres$ to discriminate data distributions, enabling robust zero-shot transfer while maintaining long-term memorization, and reports strong MTIL and CIL results with around a 60% reduction in train-parameter costs. Overall, the method enables scalable, zero-shot-capable continual learning for vision-language foundations with improved performance and efficiency across multi-domain and few-shot scenarios.

Abstract

Continual learning can empower vision-language models to continuously acquire new knowledge, without the need for access to the entire historical dataset. However, mitigating the performance degradation in large-scale models is non-trivial due to (i) parameter shifts throughout lifelong learning and (ii) significant computational burdens associated with full-model tuning. In this work, we present a parameter-efficient continual learning framework to alleviate long-term forgetting in incremental learning with vision-language models. Our approach involves the dynamic expansion of a pre-trained CLIP model, through the integration of Mixture-of-Experts (MoE) adapters in response to new tasks. To preserve the zero-shot recognition capability of vision-language models, we further introduce a Distribution Discriminative Auto-Selector (DDAS) that automatically routes in-distribution and out-of-distribution inputs to the MoE Adapter and the original CLIP, respectively. Through extensive experiments across various settings, our proposed method consistently outperforms previous state-of-the-art approaches while concurrently reducing parameter training burdens by 60%. Our code locates at https://github.com/JiazuoYu/MoE-Adapters4CL

Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters

TL;DR

and outputs

. DDAS leverages per-task autoencoders with a threshold

to discriminate data distributions, enabling robust zero-shot transfer while maintaining long-term memorization, and reports strong MTIL and CIL results with around a 60% reduction in train-parameter costs. Overall, the method enables scalable, zero-shot-capable continual learning for vision-language foundations with improved performance and efficiency across multi-domain and few-shot scenarios.

Abstract

Paper Structure (12 sections, 3 equations, 7 figures, 12 tables)

This paper contains 12 sections, 3 equations, 7 figures, 12 tables.

Introduction
Related Works
Methodology
Continual Learning
Framework Overview
Incremental Mixture-of-Experts Adapters
Distribution Discriminative Auto-Selector
Experiments
Experimental Setting
Comparison with State-of-the-art Methods
Ablation Study
Discussion

Figures (7)

Figure 1: Comparison of various popular architectures to address CL. (a) Traditional dynamic expansion-based CL cannot distinguish unseen data. (b) Zero-shot CL zheng2023preventing suffers from significant computational burdens. (c) The proposed MoE-Adapters and DDAS collaborate to form a parameter-efficient, zero-shot CL.
Figure 2: Overall framework of the proposed method. (a) At the training stage, CLIP's image and text encoders $(\mathcal{F}_I,\mathcal{F}_T)$ take input samples from Task t. In each of transformer blocks, there is a MoE-Adapters, whose input is the tokens $\textbf{x}^t$ from MHSA. The router takes the task-specific [CLS] token $\textbf{c}^t$ as input and produces experts' weights $W_i^t$ and $W_j^t$ to combine the expert's output. DDAS is trained using only images via the MSE loss defined by Eq. \ref{['eq:chooseid1']}. (b) At the inference stage, the proposed DDAS determines the data distribution by comparing the distribution $\{d^t\}_{t=1}^T$ in each autoencoder of the task-agnostic images. It can automatically assign the testing data into MoE-Adapters or original CLIP to predict with either seen or unseen data.
Figure 3: The three distinct combinations among activated experts (a) both trained, (b) trainable and frozen, (c) both experts are frozen, and only the router is trainable.
Figure 4: Analysis of expert's number in different training iterations. The results can be referred to "Ours" and "Ours$\dagger$" in Table \ref{['tab:compareAvg_Last_Transfer']}.
Figure 5: t-SNE on DDAS's output of each task on full-shot and few-shot MTIL. The corresponding task names from $id=1-11$ are matches with the datasets listed from left to right in Table \ref{['tab:compareAvg_Last_Transfer']}.
...and 2 more figures

Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters

TL;DR

Abstract

Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters

Authors

TL;DR

Abstract

Table of Contents

Figures (7)