Table of Contents
Fetching ...

LVP-CLIP:Revisiting CLIP for Continual Learning with Label Vector Pool

Yue Ma, Huantao Ren, Boyu Wang, Jingang Jin, Senem Velipasalar, Qinru Qiu

TL;DR

This paper introduces Label Vector Pool (LVP), a new framework for continual learning that replaces text-based class labels with reference image embeddings stored in a pool, enabling CLIP-based models to learn new tasks without forgetting. It presents three practical variants—LVP-I, LVP-IT, and LVP-C—each balancing memory, computation, and accuracy by using image embeddings alone, a text–image blend, or a learned classifier on LVP vectors. Across class-, domain-, and cross-task incremental settings (including a large CTIL benchmark with 595 classes), LVP-CLIP demonstrates superior performance and significantly reduced forgetting, while offering low memory overhead and high parallelizability. The work shows that high-dimensional embedding spaces and modality-agnostic labeling can yield substantial gains in continual learning, with practical benefits for scalability and privacy. Overall, LVP-CLIP provides a robust, efficient alternative to prompt-based and rehearsal-heavy continual learning approaches, accelerating real-world deployment.

Abstract

Continual learning aims to update a model so that it can sequentially learn new tasks without forgetting previously acquired knowledge. Recent continual learning approaches often leverage the vision-language model CLIP for its high-dimensional feature space and cross-modality feature matching. Traditional CLIP-based classification methods identify the most similar text label for a test image by comparing their embeddings. However, these methods are sensitive to the quality of text phrases and less effective for classes lacking meaningful text labels. In this work, we rethink CLIP-based continual learning and introduce the concept of Label Vector Pool (LVP). LVP replaces text labels with training images as similarity references, eliminating the need for ideal text descriptions. We present three variations of LVP and evaluate their performance on class and domain incremental learning tasks. Leveraging CLIP's high dimensional feature space, LVP learning algorithms are task-order invariant. The new knowledge does not modify the old knowledge, hence, there is minimum forgetting. Different tasks can be learned independently and in parallel with low computational and memory demands. Experimental results show that proposed LVP-based methods outperform the current state-of-the-art baseline by a significant margin of 40.7%.

LVP-CLIP:Revisiting CLIP for Continual Learning with Label Vector Pool

TL;DR

This paper introduces Label Vector Pool (LVP), a new framework for continual learning that replaces text-based class labels with reference image embeddings stored in a pool, enabling CLIP-based models to learn new tasks without forgetting. It presents three practical variants—LVP-I, LVP-IT, and LVP-C—each balancing memory, computation, and accuracy by using image embeddings alone, a text–image blend, or a learned classifier on LVP vectors. Across class-, domain-, and cross-task incremental settings (including a large CTIL benchmark with 595 classes), LVP-CLIP demonstrates superior performance and significantly reduced forgetting, while offering low memory overhead and high parallelizability. The work shows that high-dimensional embedding spaces and modality-agnostic labeling can yield substantial gains in continual learning, with practical benefits for scalability and privacy. Overall, LVP-CLIP provides a robust, efficient alternative to prompt-based and rehearsal-heavy continual learning approaches, accelerating real-world deployment.

Abstract

Continual learning aims to update a model so that it can sequentially learn new tasks without forgetting previously acquired knowledge. Recent continual learning approaches often leverage the vision-language model CLIP for its high-dimensional feature space and cross-modality feature matching. Traditional CLIP-based classification methods identify the most similar text label for a test image by comparing their embeddings. However, these methods are sensitive to the quality of text phrases and less effective for classes lacking meaningful text labels. In this work, we rethink CLIP-based continual learning and introduce the concept of Label Vector Pool (LVP). LVP replaces text labels with training images as similarity references, eliminating the need for ideal text descriptions. We present three variations of LVP and evaluate their performance on class and domain incremental learning tasks. Leveraging CLIP's high dimensional feature space, LVP learning algorithms are task-order invariant. The new knowledge does not modify the old knowledge, hence, there is minimum forgetting. Different tasks can be learned independently and in parallel with low computational and memory demands. Experimental results show that proposed LVP-based methods outperform the current state-of-the-art baseline by a significant margin of 40.7%.

Paper Structure

This paper contains 24 sections, 9 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Comparison with traditional CLIP-based approaches. While traditional methods compare similarity between the encoded test image and text labels, our approach evaluates similarity between image embeddings directly and makes the text encoder play an auxiliary role when possible.
  • Figure 2: The hypothesis is that embeddings in the same modality should be more similar to each other. The training image embedding $\Tilde{I}_1$ is expected to be more similar to the test image embedding $\hat{I}_1$ than the text embeddings $T_1$ and $T_2$, i.e., $\langle \hat{I}_1,\Tilde{I}_1\rangle > \langle \hat{I}_1,T_1\rangle, \langle\hat{I}_1,\Tilde{I}_1\rangle > \langle \hat{I}_1,T_2\rangle$.
  • Figure 3: Framework of LVP-CLIP. Firstly, the concept of LVP is demonstrated. Secondly, three realizations of LVP is shown as LVP-T, LVP-I and LVP-IT. LVP-T known as zero-shot is LVP generated form the text encoder. Our proposed LVP-I is the mean of image embeddings of each class in the training set. LVP-IT can be obtained as a combination of LVP-T and LVP-I with the task-specific trainable paremeters $\alpha,\beta$ of each class. In addition, LVP-C is a classifier optimized on LVP.
  • Figure 4: Distributions of the same feature across different classes in CIFAR100 training set. We examine the first 4 feature distributions for 3 classes, denoted as $E^k_i,i\in[1,4],k\in[1,3]$. Panels (a) to (d) show the distributions of $E^k_1$ to $E^k_4$ for these three classes. The vertical lines represent the mean values of each distribution. As shown, all features approximately follow a Gaussian distribution, with different combinations of means across different classes.
  • Figure 5: Images and labels from the CORe50 core50 dataset. There are a total of 50 classes but only 10 object names. Each object has five different instances as five classes. Since the class names are very close to each other as text, it is nearly impossible to separate them by zero-shot learning.
  • ...and 2 more figures