Table of Contents
Fetching ...

DesCLIP: Robust Continual Learning via General Attribute Descriptions for VLM-Based Visual Recognition

Chiyuan He, Zihuan Qiu, Fanman Meng, Linfeng Xu, Qingbo Wu, Hongliang Li

Abstract

Continual learning of vision-language models (VLMs) focuses on leveraging cross-modal pretrained knowledge to incrementally adapt to expanding downstream tasks and datasets, while tackling the challenge of knowledge forgetting. Existing research often focuses on connecting visual features with specific class text in downstream tasks, overlooking the latent relationships between general and specialized knowledge. Our findings reveal that forcing models to optimize inappropriate visual-text matches exacerbates forgetting of VLM's recognition ability. To tackle this issue, we propose DesCLIP, which leverages general attribute (GA) descriptions to guide the understanding of specific class objects, enabling VLMs to establish robust vision-GA-class trilateral associations rather than relying solely on vision-class connections. Specifically, we introduce a language assistant to generate concrete GA description candidates via proper request prompts. Then, an anchor-based embedding filter is designed to obtain highly relevant GA description embeddings, which are leveraged as the paired text embeddings for visual-textual instance matching, thereby tuning the visual encoder. Correspondingly, the class text embeddings are gradually calibrated to align with these shared GA description embeddings. Extensive experiments demonstrate the advancements and efficacy of our proposed method, with comprehensive empirical evaluations highlighting its superior performance in VLM-based recognition compared to existing continual learning methods.

DesCLIP: Robust Continual Learning via General Attribute Descriptions for VLM-Based Visual Recognition

Abstract

Continual learning of vision-language models (VLMs) focuses on leveraging cross-modal pretrained knowledge to incrementally adapt to expanding downstream tasks and datasets, while tackling the challenge of knowledge forgetting. Existing research often focuses on connecting visual features with specific class text in downstream tasks, overlooking the latent relationships between general and specialized knowledge. Our findings reveal that forcing models to optimize inappropriate visual-text matches exacerbates forgetting of VLM's recognition ability. To tackle this issue, we propose DesCLIP, which leverages general attribute (GA) descriptions to guide the understanding of specific class objects, enabling VLMs to establish robust vision-GA-class trilateral associations rather than relying solely on vision-class connections. Specifically, we introduce a language assistant to generate concrete GA description candidates via proper request prompts. Then, an anchor-based embedding filter is designed to obtain highly relevant GA description embeddings, which are leveraged as the paired text embeddings for visual-textual instance matching, thereby tuning the visual encoder. Correspondingly, the class text embeddings are gradually calibrated to align with these shared GA description embeddings. Extensive experiments demonstrate the advancements and efficacy of our proposed method, with comprehensive empirical evaluations highlighting its superior performance in VLM-based recognition compared to existing continual learning methods.

Paper Structure

This paper contains 22 sections, 22 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: (a) Existing methods: learning to match unfamiliar specific classes leads to a risk of forgetting. (b) Ours: learning to construct connections with highly relevant general attributes and gradually calibrate class-text embeddings, thereby significantly mitigating the forgetting caused by fitting unfamiliar class knowledge in downstream continual tasks. (c) The maintained low loss on the reference validation set demonstrates that our approach preserves the substantial potential of CL by retaining the pretrained general knowledge, thereby achieving optimal performance, compared to SL zhang2023slca and SPU zhang2024overcoming.
  • Figure 2: (a) Continual fine-tuning with different task orders: familiar-first order and unfamiliar-first order. (b) Initially matching unfamiliar class texts leads to more severe forgetting, which negatively impacts the learning of subsequent tasks, resulting in overall poorer CL performance. (c) CKA kornblith2019similarity similarities of representation compared to a pretrained CLIP. Task 1-5 represents 'familiar to unfamiliar'. Learning to match unfamiliar classes further disrupts the integrity of pretrained representations.
  • Figure 3: The overview of our proposed DesCLIP. At each task $t$, language assistant are requested to generate sufficient general attribute description candidates for the classes in the current task, which are then encoded into embeddings via the CLIP's textual encoder. Using the anchor-based embedding filter (AEF), we filter the candidate embeddings by selecting those highly relevant to the visual features of the instances. The filtered embeddings are paired with the instance visual features to compute a class-agnostic instance matching loss. Correspondingly, class text embeddings are calibrated through shift weights to align with these shared filtered embeddings.
  • Figure 4: DRP-guided GA description generation with a language assistant.
  • Figure 5: Per-stage average accuracies on OOD datasets.
  • ...and 8 more figures