Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters
Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Ping Hu, Dong Wang, Huchuan Lu, You He
TL;DR
Continual learning for large vision-language models faces catastrophic forgetting and high computational cost. The authors propose a parameter-efficient solution by freezing CLIP and introducing Incremental Mixture-of-Experts Adapters (MoE-Adapters) with task-specific routers and a Distribution Discriminative Auto-Selector (DDAS) that routes seen inputs to adapters and unseen inputs to CLIP; gating yields $W^t = Softmax(Topk(\mathcal{R}^t(\mathbf{c}^t)))$ and outputs $\mathbf{y}^t = \sum_{i=1}^{N_E} W_i^t \mathcal{E}_i(\mathbf{x}^t)$. DDAS leverages per-task autoencoders with a threshold $Thres$ to discriminate data distributions, enabling robust zero-shot transfer while maintaining long-term memorization, and reports strong MTIL and CIL results with around a 60% reduction in train-parameter costs. Overall, the method enables scalable, zero-shot-capable continual learning for vision-language foundations with improved performance and efficiency across multi-domain and few-shot scenarios.
Abstract
Continual learning can empower vision-language models to continuously acquire new knowledge, without the need for access to the entire historical dataset. However, mitigating the performance degradation in large-scale models is non-trivial due to (i) parameter shifts throughout lifelong learning and (ii) significant computational burdens associated with full-model tuning. In this work, we present a parameter-efficient continual learning framework to alleviate long-term forgetting in incremental learning with vision-language models. Our approach involves the dynamic expansion of a pre-trained CLIP model, through the integration of Mixture-of-Experts (MoE) adapters in response to new tasks. To preserve the zero-shot recognition capability of vision-language models, we further introduce a Distribution Discriminative Auto-Selector (DDAS) that automatically routes in-distribution and out-of-distribution inputs to the MoE Adapter and the original CLIP, respectively. Through extensive experiments across various settings, our proposed method consistently outperforms previous state-of-the-art approaches while concurrently reducing parameter training burdens by 60%. Our code locates at https://github.com/JiazuoYu/MoE-Adapters4CL
