Table of Contents
Fetching ...

CoLeCLIP: Open-Domain Continual Learning via Joint Task Prompt and Vocabulary Learning

Yukun Li, Guansong Pang, Wei Suo, Chenchen Jing, Yuling Xi, Lingqiao Liu, Hao Chen, Guoqiang Liang, Peng Wang

TL;DR

A novel approach is introduced, termed CoLeCLIP, which learns an open-domain CL model based on CLIP, which achieves new state-of-the-art performance for open-domain CL under both task- and class-incremental learning (CIL) settings.

Abstract

This paper explores the problem of continual learning (CL) of vision-language models (VLMs) in open domains, where the models need to perform continual updating and inference on a streaming of datasets from diverse seen and unseen domains with novel classes. Such a capability is crucial for various applications in open environments, e.g., AI assistants, autonomous driving systems, and robotics. Current CL studies mostly focus on closed-set scenarios in a single domain with known classes. Large pre-trained VLMs like CLIP have demonstrated superior zero-shot recognition ability, and a number of recent studies leverage this ability to mitigate catastrophic forgetting in CL, but they focus on closed-set CL in a single domain dataset. Open-domain CL of large VLMs is significantly more challenging due to 1) large class correlations and domain gaps across the datasets and 2) the forgetting of zero-shot knowledge in the pre-trained VLMs in addition to the knowledge learned from the newly adapted datasets. In this work we introduce a novel approach, termed CoLeCLIP, that learns an open-domain CL model based on CLIP. It addresses these challenges by a joint learning of a set of task prompts and a cross-domain class vocabulary. Extensive experiments on 11 domain datasets show that CoLeCLIP outperforms state-of-the-art methods for open-domain CL under both task- and class-incremental learning settings.

CoLeCLIP: Open-Domain Continual Learning via Joint Task Prompt and Vocabulary Learning

TL;DR

A novel approach is introduced, termed CoLeCLIP, which learns an open-domain CL model based on CLIP, which achieves new state-of-the-art performance for open-domain CL under both task- and class-incremental learning (CIL) settings.

Abstract

This paper explores the problem of continual learning (CL) of vision-language models (VLMs) in open domains, where the models need to perform continual updating and inference on a streaming of datasets from diverse seen and unseen domains with novel classes. Such a capability is crucial for various applications in open environments, e.g., AI assistants, autonomous driving systems, and robotics. Current CL studies mostly focus on closed-set scenarios in a single domain with known classes. Large pre-trained VLMs like CLIP have demonstrated superior zero-shot recognition ability, and a number of recent studies leverage this ability to mitigate catastrophic forgetting in CL, but they focus on closed-set CL in a single domain dataset. Open-domain CL of large VLMs is significantly more challenging due to 1) large class correlations and domain gaps across the datasets and 2) the forgetting of zero-shot knowledge in the pre-trained VLMs in addition to the knowledge learned from the newly adapted datasets. In this work we introduce a novel approach, termed CoLeCLIP, that learns an open-domain CL model based on CLIP. It addresses these challenges by a joint learning of a set of task prompts and a cross-domain class vocabulary. Extensive experiments on 11 domain datasets show that CoLeCLIP outperforms state-of-the-art methods for open-domain CL under both task- and class-incremental learning settings.
Paper Structure (16 sections, 8 equations, 2 figures, 10 tables)

This paper contains 16 sections, 8 equations, 2 figures, 10 tables.

Figures (2)

  • Figure 1: Illustration of ODCL, in which VLMs continuously acquire new knowledge from different datasets in incoming tasks, where different tasks may exhibit large domain gaps and strong class correlations. After training on the $t$-th task, VLM is evaluated on the datasets from all domains under task-incremental learning (TIL) or class-incremental learning (CIL).
  • Figure 2: Illustration of the training pipeline of CoLeCLIP at a time step $t$ of the CL task. For vocabulary learning, CoLeCLIP attaches an additional PEFT module into the text encoder of CLIP to refine the text embeddings of the classes per task and builds a class vocabulary with momentum updates. For task prompt learning, CoLeCLIP leverages a task prompt for each task to extract domain-specific patterns. In the training phase, the text encoder and image encoder of CLIP are kept frozen, while only the PEFT module and a task prompt are fine-tuned. For the purpose of illustration, we omit the subscripts representing the layer number and use $g_{y}$ instead of $g _{t} (y)$.