Learning without Forgetting for Vision-Language Models

Da-Wei Zhou; Yuanhan Zhang; Yan Wang; Jingyi Ning; Han-Jia Ye; De-Chuan Zhan; Ziwei Liu

Learning without Forgetting for Vision-Language Models

Da-Wei Zhou, Yuanhan Zhang, Yan Wang, Jingyi Ning, Han-Jia Ye, De-Chuan Zhan, Ziwei Liu

TL;DR

This work addresses the challenge of enabling Vision-Language Models to perform Class-Incremental Learning without catastrophic forgetting while leveraging cross-modal information. It introduces PROOF, a framework that freezes pre-trained encoders, adds task-specific expandable projections, and uses a cross-modal fusion module to contextualize features via visual prototypes, textual prototypes, and learnable context prompts. PROOF achieves state-of-the-art results across nine benchmark datasets and extends to cross-modal retrieval and non-overlapping data scenarios, with ablations confirming the benefits of projection expansion and fusion. The approach offers a practical, scalable pathway for continual, cross-modal learning with minimal incremental parameters and preserved zero-shot capabilities, underpinning real-world continual learning deployments.

Abstract

Class-Incremental Learning (CIL) or continual learning is a desired capability in the real world, which requires a learning system to adapt to new tasks without forgetting former ones. While traditional CIL methods focus on visual information to grasp core features, recent advances in Vision-Language Models (VLM) have shown promising capabilities in learning generalizable representations with the aid of textual information. However, when continually trained with new classes, VLMs often suffer from catastrophic forgetting of former knowledge. Applying VLMs to CIL poses two major challenges: 1) how to adapt the model without forgetting; and 2) how to make full use of the multi-modal information. To this end, we propose PROjectiOn Fusion (PROOF) that enables VLMs to learn without forgetting. To handle the first challenge, we propose training task-specific projections based on the frozen image/text encoders. When facing new tasks, new projections are expanded and former projections are fixed, alleviating the forgetting of old concepts. For the second challenge, we propose the fusion module to better utilize the cross-modality information. By jointly adjusting visual and textual features, the model can capture semantic information with stronger representation ability. Extensive experiments on nine benchmark datasets validate PROOF achieves state-of-the-art performance. Code is available at https://github.com/zhoudw-zdw/PROOF

Learning without Forgetting for Vision-Language Models

TL;DR

Abstract

Paper Structure (28 sections, 12 equations, 8 figures, 3 tables, 1 algorithm)

This paper contains 28 sections, 12 equations, 8 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Vision-Language Model (VLM) Tuning
Class-Incremental Learning (CIL)
CIL with VLM
Preliminaries
Class-Incremental Learning
Vision-Language Model
Overcome Forgetting in Class-Incremental Learning
Proof: Projection Fusion for VLM
Expandable Feature Projection
Contextualizing Projections with Projection Fusion
Summary of Proof
Experiment
Experimental Setup
...and 13 more sections

Figures (8)

Figure 1: Illustration of Proof. The model learns expandable projections and aggregates them to get the aggregated features. The query instance, prototype features, textual features, and context prompts are fed into the cross-modal fusion module. The fusion process utilizes self-attention to co-adapt the input set, which outputs the adapted features. The adapted query embedding is separately matched among the visual prototypes and textual features to get the final prediction. Red parts are trainable while gray ones are frozen.
Figure 2: Incremental performance of different methods. We report the performance gap after the last incremental stage of Proof and the runner-up method at the end of the line. All methods are based on the same backbone/weight.
Figure 3: Incremental performance of different methods with large base classes. We report the performance gap after the last incremental stage of Proof and the runner-up method at the end of the line. All methods are based on the same backbone/weight.
Figure 4: Ablation study. Left: experiments on nine benchmarks with OpenAI weights. Middle: ablation study on compositional components in Proof. Every part improves the performance of CIL. Right:$\mathcal{A}_B$ and $\bar{\mathcal{A}}$ with change of context prompts. The performance is robust to the change of context prompt length.
Figure 5: Left: Variations of context information. The choice of using visual prototypes, textual prototypes, and context prompts as the context information achieves the best performance. Middle: Variations of projection layers. The choice of using a single linear layer as the projection layer achieves the best performance. Right: Number of parameters in different methods. The shaded area represents the parameters used during training but dropped during inference.
...and 3 more figures

Learning without Forgetting for Vision-Language Models

TL;DR

Abstract

Learning without Forgetting for Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)