Table of Contents
Fetching ...

AttriCLIP: A Non-Incremental Learner for Incremental Knowledge Learning

Runqi Wang, Xiaoyue Duan, Guoliang Kang, Jianzhuang Liu, Shaohui Lin, Songcen Xu, Jinhu Lv, Baochang Zhang

TL;DR

This paper proposes a non-incremental learner, named AttriCLIP, to incrementally extract knowledge of new classes or tasks, built upon the pre-trained visual-language model CLIP and compares it with CLIP-based and previous state-of-the-art continual learning methods in realistic settings with domain-shift and long-sequence learning.

Abstract

Continual learning aims to enable a model to incrementally learn knowledge from sequentially arrived data. Previous works adopt the conventional classification architecture, which consists of a feature extractor and a classifier. The feature extractor is shared across sequentially arrived tasks or classes, but one specific group of weights of the classifier corresponding to one new class should be incrementally expanded. Consequently, the parameters of a continual learner gradually increase. Moreover, as the classifier contains all historical arrived classes, a certain size of the memory is usually required to store rehearsal data to mitigate classifier bias and catastrophic forgetting. In this paper, we propose a non-incremental learner, named AttriCLIP, to incrementally extract knowledge of new classes or tasks. Specifically, AttriCLIP is built upon the pre-trained visual-language model CLIP. Its image encoder and text encoder are fixed to extract features from both images and text. Text consists of a category name and a fixed number of learnable parameters which are selected from our designed attribute word bank and serve as attributes. As we compute the visual and textual similarity for classification, AttriCLIP is a non-incremental learner. The attribute prompts, which encode the common knowledge useful for classification, can effectively mitigate the catastrophic forgetting and avoid constructing a replay memory. We evaluate our AttriCLIP and compare it with CLIP-based and previous state-of-the-art continual learning methods in realistic settings with domain-shift and long-sequence learning. The results show that our method performs favorably against previous state-of-the-arts. The implementation code can be available at https://github.com/bhrqw/AttriCLIP.

AttriCLIP: A Non-Incremental Learner for Incremental Knowledge Learning

TL;DR

This paper proposes a non-incremental learner, named AttriCLIP, to incrementally extract knowledge of new classes or tasks, built upon the pre-trained visual-language model CLIP and compares it with CLIP-based and previous state-of-the-art continual learning methods in realistic settings with domain-shift and long-sequence learning.

Abstract

Continual learning aims to enable a model to incrementally learn knowledge from sequentially arrived data. Previous works adopt the conventional classification architecture, which consists of a feature extractor and a classifier. The feature extractor is shared across sequentially arrived tasks or classes, but one specific group of weights of the classifier corresponding to one new class should be incrementally expanded. Consequently, the parameters of a continual learner gradually increase. Moreover, as the classifier contains all historical arrived classes, a certain size of the memory is usually required to store rehearsal data to mitigate classifier bias and catastrophic forgetting. In this paper, we propose a non-incremental learner, named AttriCLIP, to incrementally extract knowledge of new classes or tasks. Specifically, AttriCLIP is built upon the pre-trained visual-language model CLIP. Its image encoder and text encoder are fixed to extract features from both images and text. Text consists of a category name and a fixed number of learnable parameters which are selected from our designed attribute word bank and serve as attributes. As we compute the visual and textual similarity for classification, AttriCLIP is a non-incremental learner. The attribute prompts, which encode the common knowledge useful for classification, can effectively mitigate the catastrophic forgetting and avoid constructing a replay memory. We evaluate our AttriCLIP and compare it with CLIP-based and previous state-of-the-art continual learning methods in realistic settings with domain-shift and long-sequence learning. The results show that our method performs favorably against previous state-of-the-arts. The implementation code can be available at https://github.com/bhrqw/AttriCLIP.
Paper Structure (17 sections, 11 equations, 7 figures, 11 tables)

This paper contains 17 sections, 11 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: (a) Traditional framework for continual learning. The encoder and the classifier are trained by tasks in sequence, some of which even need extra memory data. In the framework, the model parameters of the current task are fine-tuned from the parameters trained by the previous last task and then are used for the classification of all seen tasks. The total number of categories the model can classify is fixed in the classifier. (b) Our proposed AttriCLIP for continual learning. AttriCLIP is based on CLIP, which classifies images by contrasting them with their descriptive texts. The trainable prompts are selected by the attributes of the current image from a prompt pool. The prompts are different if the attributes of the image are different. The trained prompts are concatenated with the class name of the image, which serve as a more accurate supervised signal for image classification than labels.
  • Figure 2: Framework of AttriCLIP. The image keys $\mathbf{k}_i$ and the textual prompts $\mathbf{P}_i$ in the attribute word bank are trainable parameters. The blue and green boxer represent the image and text streams, respectively. The attribute word bank is optimized by three loss functions. $\mathcal{L}_m$ is the classification loss adopted to maximize the similarity between image feature $\mathbf{z}$ and the corresponding text features $\mathbf{w}$. $\mathcal{L}_k$ is designed to shorten the distance between the selected keys (e.g., $\mathbf{k}_2$ and $\mathbf{k}_n$) and the image feature $\mathbf{z}$, so that the keys learn generalizable attributes. $\mathcal{L}_p$ makes the embeddings of the prompts $g_\mathbf{\psi}(\mathbf{P}_i)$ orthogonal to increase the diversity of the prompts.
  • Figure 3: Ablation study of (a) the prompt length $M$, (b) the bank size $N$, and (c) the number of selected keys $C$ on CIFAR100.
  • Figure 4: Visualization of the selected prompts of the same image using Grad-CAM selvaraju2017grad.
  • Figure 5: Visualization of the same prompts on different images using Grad-CAM selvaraju2017grad.
  • ...and 2 more figures