Table of Contents
Fetching ...

Don't Stop Learning: Towards Continual Learning for the CLIP Model

Yuxuan Ding, Lingqiao Liu, Chunna Tian, Jingyuan Yang, Haoxuan Ding

TL;DR

This work investigates the continual learning problem for the CLIP model, showing that fine-tuning on target concepts can substantially degrade zero-shot and image-text matching capabilities. It introduces a systematic evaluation protocol with OST and MST settings and develops four extensions of existing continual learning methods (LwF, GeoDL, IMM, RKR) adapted to CLIP, along with a novel VR-LwF approach that uses replayed vocabulary as pseudo-classes to mitigate forgetting. VR-LwF delivers the strongest overall performance by preserving prior capabilities while still benefiting from fine-tuning, highlighting a practical path for incremental improvements of CLIP. The study also clarifies that CL-CLIP presents unique challenges distinct from traditional continual learning, motivating future research on robust multimodal continual updating.

Abstract

The Contrastive Language-Image Pre-training (CLIP) Model is a recently proposed large-scale pre-train model which attracts increasing attention in the computer vision community. Benefiting from its gigantic image-text training set, the CLIP model has learned outstanding capabilities in zero-shot learning and image-text matching. To boost the recognition performance of CLIP on some target visual concepts, it is often desirable to further update the CLIP model by fine-tuning some classes-of-interest on extra training data. This operation, however, raises an important concern: will the update hurt the zero-shot learning or image-text matching capability of the CLIP, i.e., the catastrophic forgetting issue? If yes, could existing continual learning algorithms be adapted to alleviate the risk of catastrophic forgetting? To answer these questions, this work conducts a systemic study on the continual learning issue of the CLIP model. We construct evaluation protocols to measure the impact of fine-tuning updates and explore different ways to upgrade existing continual learning methods to mitigate the forgetting issue of the CLIP model. Our study reveals the particular challenges of CLIP continual learning problem and lays a foundation for further researches. Moreover, we propose a new algorithm, dubbed Learning without Forgetting via Replayed Vocabulary (VR-LwF), which shows exact effectiveness for alleviating the forgetting issue of the CLIP model.

Don't Stop Learning: Towards Continual Learning for the CLIP Model

TL;DR

This work investigates the continual learning problem for the CLIP model, showing that fine-tuning on target concepts can substantially degrade zero-shot and image-text matching capabilities. It introduces a systematic evaluation protocol with OST and MST settings and develops four extensions of existing continual learning methods (LwF, GeoDL, IMM, RKR) adapted to CLIP, along with a novel VR-LwF approach that uses replayed vocabulary as pseudo-classes to mitigate forgetting. VR-LwF delivers the strongest overall performance by preserving prior capabilities while still benefiting from fine-tuning, highlighting a practical path for incremental improvements of CLIP. The study also clarifies that CL-CLIP presents unique challenges distinct from traditional continual learning, motivating future research on robust multimodal continual updating.

Abstract

The Contrastive Language-Image Pre-training (CLIP) Model is a recently proposed large-scale pre-train model which attracts increasing attention in the computer vision community. Benefiting from its gigantic image-text training set, the CLIP model has learned outstanding capabilities in zero-shot learning and image-text matching. To boost the recognition performance of CLIP on some target visual concepts, it is often desirable to further update the CLIP model by fine-tuning some classes-of-interest on extra training data. This operation, however, raises an important concern: will the update hurt the zero-shot learning or image-text matching capability of the CLIP, i.e., the catastrophic forgetting issue? If yes, could existing continual learning algorithms be adapted to alleviate the risk of catastrophic forgetting? To answer these questions, this work conducts a systemic study on the continual learning issue of the CLIP model. We construct evaluation protocols to measure the impact of fine-tuning updates and explore different ways to upgrade existing continual learning methods to mitigate the forgetting issue of the CLIP model. Our study reveals the particular challenges of CLIP continual learning problem and lays a foundation for further researches. Moreover, we propose a new algorithm, dubbed Learning without Forgetting via Replayed Vocabulary (VR-LwF), which shows exact effectiveness for alleviating the forgetting issue of the CLIP model.
Paper Structure (34 sections, 7 equations, 5 figures, 8 tables, 1 algorithm)

This paper contains 34 sections, 7 equations, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: The workflow of the CLIP model. (a) The training of CLIP is conducted by a huge dataset of noisy image-text pairs. (b) At zero-shot inference, the input of the text encoder is retrieval captions or text prompts. Image and text embeddings are generated by the encoders and compute the similarities. The most similar match can be seen as the prediction.
  • Figure 2: Overview of our proposed evaluation protocols. The CLIP model is fine-tuned with a constructed object classification dataset. After fine-tuning, the updated task, zero-shot classification and image-text retrieval task are evaluated.
  • Figure 3: The framework of the proposed VR-LwF method. The pseudo vocabulary sequences are fed into both previous and current text encoders. The logits of the replayed classes are enforced. The model is trained with the combination of cross-entropy classification loss and replaying distillation loss. The grey part in the figure indicates the previous model and its outputs.
  • Figure 4: Accuracy (%) of methods in each MST session. The first row of charts is updated accuracy, the second row is for ZS-Acc. From left to right, each column is for WM, IO, and TO variants in sequence. The Original CLIP and Joint-FT are lower and upper bounds of UT-Acc, respectively.
  • Figure 5: Comparison of FlickrFlickr30k and COCOCOCO retrieval performance, horizontal axis presents TR@1(%) and vertical axis presents IR@1(%). The top/bottom chart is for OST/MST results. The figure shows that distributions of Flickr and COCO results are the same, which means the evaluation is transferable.