LVP-CLIP:Revisiting CLIP for Continual Learning with Label Vector Pool
Yue Ma, Huantao Ren, Boyu Wang, Jingang Jin, Senem Velipasalar, Qinru Qiu
TL;DR
This paper introduces Label Vector Pool (LVP), a new framework for continual learning that replaces text-based class labels with reference image embeddings stored in a pool, enabling CLIP-based models to learn new tasks without forgetting. It presents three practical variants—LVP-I, LVP-IT, and LVP-C—each balancing memory, computation, and accuracy by using image embeddings alone, a text–image blend, or a learned classifier on LVP vectors. Across class-, domain-, and cross-task incremental settings (including a large CTIL benchmark with 595 classes), LVP-CLIP demonstrates superior performance and significantly reduced forgetting, while offering low memory overhead and high parallelizability. The work shows that high-dimensional embedding spaces and modality-agnostic labeling can yield substantial gains in continual learning, with practical benefits for scalability and privacy. Overall, LVP-CLIP provides a robust, efficient alternative to prompt-based and rehearsal-heavy continual learning approaches, accelerating real-world deployment.
Abstract
Continual learning aims to update a model so that it can sequentially learn new tasks without forgetting previously acquired knowledge. Recent continual learning approaches often leverage the vision-language model CLIP for its high-dimensional feature space and cross-modality feature matching. Traditional CLIP-based classification methods identify the most similar text label for a test image by comparing their embeddings. However, these methods are sensitive to the quality of text phrases and less effective for classes lacking meaningful text labels. In this work, we rethink CLIP-based continual learning and introduce the concept of Label Vector Pool (LVP). LVP replaces text labels with training images as similarity references, eliminating the need for ideal text descriptions. We present three variations of LVP and evaluate their performance on class and domain incremental learning tasks. Leveraging CLIP's high dimensional feature space, LVP learning algorithms are task-order invariant. The new knowledge does not modify the old knowledge, hence, there is minimum forgetting. Different tasks can be learned independently and in parallel with low computational and memory demands. Experimental results show that proposed LVP-based methods outperform the current state-of-the-art baseline by a significant margin of 40.7%.
